That makes sense. But, I wonder if it would be simpler if the host code detected a “poor tap” and then “tapped again”.
It would seem to me there are two separate problems:
Figuring out where the bed is during initial homing. Precision needed to within a couple of mm. Distance travelled could be large, so speed of travel is noticeable.
Obtaining a precise bed position during probing. Precision needed to within a handful of microns. Distance travelled is only a few mm, so speed of travel is not important. External tubes and wires should not have a notable impact on measurements.
For what it is worth, programming in the micro-controller is painful and highly constrained. If you can move logic up to the host it can make the resulting system more maintainable.
If this was a non-contact sensor, I’d go for that.
When the machine homes it doesn’t know if the last print was 300mm high or 0.6mm high. Picking a high threshold for homing moves, potentially, subjects the toolhead to excessive force. Techniques like QGL are “in-between” a bed mesh probe and a homing move.
I don’t want to ship something that abuses people printers. These case want the filter to keep the ultimate loading safe.
So its for safety & consistency.
Agreed, its much easier to debug and maintain. But it will have limits. Filtering above 5Hz probably wont work. I don’t know yet if that’s going to be a problem. The Prusa source seems to hint that it is. But its not like the put out an academic paper explaining everything.
But, lets say I start with something on the host… I’ll finally have a working printer again.
I actually agree with koconnor, like bambulab’s A1, go home is the first quick search for position, the next 2 quick probes at 1mm up and down. And no matter what kind of filter it is, at high accuracy it also represents the cost of increased computation time, and with that comes the possibility of increasing the measurement data to make the filter
Well, the system could home with a regular (not high) threshold, and if the system false-triggers, then the host could detect that and restart the home. Thus, you don’t need to use a high threshold to home.
To be clear, this approach may be a terrible idea. I’m just pointing out possible alternatives. FWIW, sometimes I feel other common 3d printer firmwares “only have access to a single mcu” and thus “treat all problems as only being solvable by a single mcu”.
Okay. For what it is worth, I don’t understand what you are referring to by “5Hz”.
I hear what you are both saying about the homing move not needing to be high accuracy. I wholeheartedly agree. I’m not doing any tap analysis on homing moves. They have lower precision.
I’m worried about the safety of the homing moves: are they gentle “taps” or “crashes”?
I’m worried about consistency on intermediate probing distances for tasks like quad gantry leveling that want both accuracy and longer travel.
The code as it stands works fine for short distances. But if you dial up the threshold to cover long distance homing it will “crash” the toolhead when homing from a short distance. Updating the tare value during the move solves this problem. Thats what a filter will do.
I’ll implement the filtering on the host with 10Hz updates of the tare value to the MCU. Then I can share some graphs of long vs short homing move with this feature on vs off. I think some pictures might be worth 1K words here.
Thank you, truly I appreciate the discussion. Working in a vacuum is not fun.
That could work. Imagining the printer repeatedly fail to home feels a bit “janky”. I’m certain I can do better than that with the filter.
A filter (or any other statistical technique) thats updating the MCU from the Host cant react faster than its update rate to the MCU. And I think it would be subject to the same law about filtering above the Nyquist frequency being impossible (i.e. 1/2 the update frequency). If we pull data from the sensor_bulk at 10Hz and update the tare value on the MCU at 10Hz the highest frequency drift we can compensate for is 5Hz.
I can plot a frequency distribution graph (FFT) of the force data, before the tap, and see what the dominant frequency component is (this is from the workbook):
It’s that tall line near 0, thats the drift signal.
Assuming competent engineering, I think Prusa gathered a large set of these FFT graphs for different probing speeds/locations and used them to pick a frequency cutoff that eliminates the drift signal in each printer. They picked 0.8Hz for the MK4 and 11.2Hz for the XL. My graphs from the Voron test rig look like they would be fine with the lower 0.8Hz cutoff of the MK4.
The XL has a higher cutoff frequency of 11.2Hz, but its toolhead is stationary when probing, its a moving bed machine. So you would assume there is no drift, right? Thats not what their code is hinting at. Based on my tests, my guess is this has to do with the bowden tubes rocking back and forth at this higher frequency.
I’m a bit confused. From your previous descriptions, I thought a tap showed a measurement response that is an order of magnitude more pronounced than regular noise on the sensor. For example, a 200g force observed during a tap as opposed to ±20g of variation during movement.
So, in this example, if one sets the mcu to trigger at 50g, why does it matter if the sensor is fluctuating ±20g at 0.8hz (or 11hz or 100hz)? The mcu wont trigger on it, so it doesn’t matter? The host will have all the raw measurements anyway, and it can always analyze the tap using those raw measurements.
Admittedly, if the sensor is drifting over the long term, that could lead to some false positives in the mcu. For example, if the sensor is reporting ±20g of variation, starts with an absolute offset of 20g, but drifts to an absolute offset of 60g over 20 seconds of homing movement, then I could see where an absolute trigger point of 70g could be problematic. It’s certainly strange that the sensor would drift that much over 20 seconds though. (FWIW, though, the host could periodically update an absolute trigger point in the mcu to handle accumulating static offsets…)
So, I guess I’m confused on the overall magnitudes. Do you have some graphs of raw adc vs time measurements from long probing movements? FWIW, that might better help me understand the problem.
The blue line is long term force, x axis is in seconds. But in this graph the bowden tube was really loose to not torque with the sensor so I could get the test to complete. Imagine this graph, but it drifts by ~300g over 40s. (this was a 250mm probe at 5mm/s, simulating homing after a tall print)
Thats what I’ll do this weekend. I’ll get the graphs and report back.
Thanks. The graph (and description) helps. I think I better understand the problem. I agree there are some challenges in handling it.
Separately and not particularly related, I’d expect that most people using a load cell on the toolhead will also want to use a toolhead mcu board. Be aware that multi-mcu homing can introduce an additional 25ms delay between trsync trigger and stepper motor halt ( Multiple Micro-controller Homing and Probing - Klipper documentation ).
I implemented sending an updated tare value to the MCU when I get data from the sensor. I don’t have this tare value reflected in the data coming out of the web-socket, mainly because now there are 2 tare values: whatever the tare is at the start of the probe and the current “continuous tare” value. If they were the same value the graphs would look flat and that wouldn’t demonstrate the total absolute force change.
So these use the real force/time data but the tare band is a post-probe re-creation using the same filter parameters. The continuous updates stops when the endstop triggers so thats why they tail end of the tare band is flat. But again, that’s a re-creation.
First is an example of a bowden tube that’s too short and pulls on the toolhead as it approaches the bed:
I tried this a couple of times and it triggered early. There is quite a big jump at the start that the filter is able to ignore but its not able to update the endstop fast enough to prevent triggering. At 10mm/s this has far more overshoot (almost 400g).
On my printer I can stop the probe by tapping the bowden tube with my finger. On my Prusa XL I can get the same thing to happen but it takes a larger amount of force. Basically they are rejecting more of this noise that I can with this periodic update method.
Here are some paths forward that I don’t like:
Ignore this problem entirely.
Use integer filter coefficients. I think this also runs into needing 64 bit multiplication. I’ve seen multiple warnings that the quantization can cause the filters to become unstable (tend to infinity) and they require individual analysis and testing. I don’t see a lot of promise in something that requires that much care from users.
Use the recursive Exponential Moving Average (EMA) instead. This is integer friendly but less sophisticated. It behaves like a low pass filter but it doesn’t have a sharp cutoff. Its sensitive to probing speed in a way the butterworth filter is not (low speed probes look like drift). I crashed the toolchanger trying to get his to work early on in the project, so I’m weary. I think this page is a pretty good explanation.
Eliminate the load_cell_endstop and the sensor buffer on the MCU, send each measurement back immediately and do all the calculations on the Host. This eliminates all the clear performance advantages of buffering and intermittent processing on the host. This is what Creality did, but only at 80SPS, not the 320SPS of the HX717 (or higher on other chips).
Some ideas a I do like:
Continuous updates from the Host to the MCU based on the filter. Clearly this works from some kinds of drift. I don’t have a good idea for improving this to achieve higher cutoff frequencies.
A hybrid where we offer a float math filter on the MCU if the chip is capable and updates from the Host if its not. This requires updates to firmware compilation and a new way to tell if the MCU has FPU from the Host. Not sure how much work this would be.
Do fixed point math filters. We have 32 bit ADC outputs, so to prevent overflow we would need to multiply into a 64 bit result. If there is a good way to performance test it I have code that we could try out. Each multiply op ends up being 3 x 64 bit multiplications, so I’m cautious about this actually being fast. Maybe a deeper dive will yield more optimizations. Edit: if I convert to grams and subtract the tare value and limit the allowed drift to 2048g (that should be plenty!) I think I can keep it all in 32 x 32 multiplications
Maybe you look at the above and see something I missed or you like something I’m not keen on.
I’m not sure I have any “good ideas”. I do have a few high-level thoughts; maybe they’ll spark further ideas.
For what it is worth, I still find the use of “filtering by frequencies” to be strange here. I’m guessing you are using the filters to weed out low frequency signals in an attempt to identify the “high frequency response” associated with a “tap”. However, from your graphs, it seems to me the tap response is linear, the bowden tube forces are mostly linear, and the measurement noise appears as gaussian noise. So, it seems strange to me to use frequency analysis in this domain.
It also seems strange to me that the homing code would detect a trigger on a “downward force” (as I think you were reporting in your second graph). It would seem to me that a “tap” would always be an “upward force” on the toolhead and so there would be no reason to check for a “downward force”.
Have you considered adding a simultaneous “failsafe” test using an absolute force minimum and maximum that is configured once during initial calibration? Something like “if the sensor ever reports a 1000g force then we need to stop”. That might offer some protection against overly aggressive “filtering code”.
Have you considered implementing an mcu based trigger test on the derivative of the sensor measurements? That is, using the difference between the current sensor measurement and the last sensor measurement. If a “tap” is detected by knowing that the force increases linearly according to the depressed distance (eg, 1000g per mm), then one could calculate the “tap response” as a function of time (eg, 1000g/mm * 10mm/s == 10000g/s), and then one could calculate the expected change per sample (eg, 10000g/s / 250samples/s == 40g/sample). One could then program this into the mcu as the trigger point. If the raw value is too susceptible to noise, a simple “moving average” calculating over just a few samples might clean that up (eg, this_diff = this_measurement - last_measurement; this_avg_diff = (last_avg_diff + this_diff) / 2; ).
Maybe one way of looking at this differently is that the Exponential Moving Average (EMA) has a frequency response. The EMA is a low pass filter, it just doesn’t have good attenuation in the stop band or a sharp cutoff. Switching to the Butterworth filter gives more direct control over the pass band, more attenuation in the stop band and a sharper cutoff.
Also, think of the collision in terms of an impulse response. It upsets the steady state of the filter because it looks like the beginning of a large high frequency wave. If the collision is linear forever then yes, the filter should eventually settle back to 0. But these Infinite Impulse Response (IIR) filters and take a long time to settle. The impulse response is what we are detecting and that does work.
I’m like, 75% done with a fixed point c implementation of the SOS filter. I’m shooting for “fast” on a Cortex M0-M3 with the single cycle 32x32 multiply. So no up-conversion to 64 bit / no support for umull. I think this would work on most of the chips on the toolhead boards.
If the fixed pint thing fails I’ll be looking into this next.
I had not thought about doing it on the MCU. I ran across this idea when I was researching knee/elbow finding algorithms. As you say, you probably also need a filter for noise (i.e. a low pass filter).
But that got me thinking, you’ve essentially suggested finding the elbow/knee on the MCU, so what about Kneedle? The Kneedle paper does demonstrate using it “online” to detect knees and they claim the latency is low: https://raghavan.usc.edu/papers/kneedle-simplex11.pdf
I’m skeptical of how much tuning & config either of these might require and how brittle they would be. The butterworth filter requires no re-tuning, Prusa shipped the same filter on every printer for all probing speeds. That kind of robustness is very friendly when users copy config files.
In the code I have been careful not to pick a preferred polarity. Checking both in the MCU is easy and improves safety. I’d like to add a “reverse” config option to the sensors so this can be flipped if desired instead of re-wiring. (All the Prusa hardware ships with the polarity in my graphs. )
Absolutely, like that idea. Having a calibration tare reference in the config would add some safety margin. Maybe a default of +/- 2Kg would work?
I have a stand alone c program (git gist) than uses Q12 format fixed point math to implement the second order sections filter. Here is some sample output:
It uses the “4-way” multiplication implementation, as mentioned above, so it wont require 64 bit math or types. Most (but not all) of the ARM Cortex M0-3 cores currently shipping on toolhead boards have the single cycle 32x32 multiply instruction. It shouldn’t be hard (or expensive) to get vendors to use the right core in toolhead boards designed for load cell use. Also you can use a filter with only 1 section to cut back on the CPU time if needed. (The sample uses 4 as a stress test)
32 bit multiplication is used to convert from counts down to grams. We can send the appropriate parameters from the Host and the MCU wont have to perform any division. If someone wants to use a sensor with an enormous number of counts per gram, its designed to just truncate the extra resolution with a bit shift. We don’t need any more resolution than 1/1000 of a gram.
I can do the trigger check in grams, instead of in counts, so the reverse operations are not needed.
Q12 will put an upper safe limit on the max force during a probe at 2^11=2048g. We just discussed having hard limits so I think this is a good tradeoff, keeping as many bits as possible for the filter coefficients, which are mostly < 0. The limit will have to be enforced by checking the raw count values before they get converted so it wont overflow.
In this screenshot (from the websocket tool) the filter is on and set to trigger at 40g. I’m moving the bowden tube around with my hand to try and make it trigger, but it doesn’t care:
I have output() logs to verify that I’m not being fooled:
mcu 'mcu': #output: Filter Trigger at 44 grams
mcu 'mcu': #output: Filter Trigger at 93 grams
mcu 'mcu': #output: Filter Trigger at 123 grams
...
I also added the safety limit idea in such that:
The load cell stores a reference_tare_count when it gets calibrated
The +/- safety limit is relative to that reference value
When homing it will shutdown() if the safety limit is breached. When not homing, its ignored.
When not homing or not using the filter, the endstop uses the existing static trigger threshold to decide if its triggered. I have this set to 100g now.
When homing with the filter it uses a new filter specific trigger threshold. This can be a smaller value e.g. 40g.
QUERY_ENDSTOPS still works if you push on the sensor because its not homing and uses the static threshold. It only uses the result from the filter while homing.
I also disconnected tearing of the load cell from tearing on the endstop. So the load cell can tare once at startup (using the reference_tare_count) and all measurements will forever be relative to that. If the user tares the load cell by command, it sends out an event. The endstop listens and updates it tare & reference based on the events. The endstop still does a tare before each homing command, but this is now local to the endstop.
I think this feels “right” because the load cell has a continuous chart of force while probing instead of repeatedly jumping back to 0.
I ran the filtering code continuously on my MCU to gauge check performance:
Without Filter:
Filtering Continuously @ 320Hz:
So that’s an increase of 0.16% MCU load and 0.08% awake time.
To be clear, this is the wrong chip to test on. Its an STM32F446 / Arm Cortex M4 @180Mhz. This core is really fast vs what we get on toolhead boards. But these numbers are so small that I think this is going to work out fine. Even if the M0+ ends up 3x slower its only a ~1% penalty.
(On the M0 non-plus core, int math is 32x worse, plus a frequency cut of 1/3 and this turns into more like a 15% penalty which is getting pretty large)
Separately I found the CMSIS library that implements the same sos filter/biquad algorithm. The library is intended to use the optional hardware DSP extensions, but has a fallback for chip variants that don’t have DSP instructions. The fallback for multiplication is type expansion to 64 bits. So the obvious question is: is that actually faster?
On the Cortex M4 it is faster, but it has a dedicated UMULL instruction than the M0+ lacks:
So this really needs testing on an M0 and M0+ to decide which version is better. This stack overflow question seems to suggest the long form version is faster on the M0+.
[quote=“koconnor, post:95, topic:9751, full:true”]
I think every ARM chip I’ve ever looked at had the single cycle multiply instruction. [/quote]
That’s not true and might be a very dangerous assumption to make. The timing varies quite a bit across ARM cores (here are the most popular used by STM32F MCUs):
If you’re performing the multiplication operations within a timed window you will, probably, be fine but if you are counting instruction cycles (ie writing in assembly langauge), you’re going to have to research the operation of the selected multiply instruction.
I was referring to the ARM M0 and M0+ cores having an option for single cycle multiply or 32 cycle multiply. Every M0/M0+ chip I’ve looked at (eg, atsam c20, atsam c21, stm32f0, stm32g0, rp2040) implemented the single cycle multiply.
The thing I need to do is umull which is emulated on the M0 (something like this) and it takes ~34 clock cycles. Its the same basic thing I was going to do in c, it ends up being 4 muls plus some adds. I’ll ship the simple implementation and assume the compiler does the right thing everywhere:
On the old M0 cores where multiply is 32 cycles, this will be slow. But we don’t care about that core. No boards exist yet (except for the Creality one which doesn’t need to run this anyway)
FYI, the ARM Cortex-M0 has an option for single cycle 32bit MULS and all the chips we support (stm32f0) implement that single cycle 32bit MULS.
Cheers,
-Kevin
EDIT: Just to be clear, I understand you are doing a 64bit multiply and I also understand that is faster on Cortex-M3 (and later) chips. My suggestion is to “not worry about it”, as there is a lot of processing flexibility for code running in “task context”.
(I’ve started on the rebase for bulk_sensor so I squashed the main work branch and kept this other branch as a backup. Things were just getting too messy.)
Since we decided 64 bit multiply is OK, maybe I’ll do that in the counts_to_grams(). Then I can check for the 2KG limit violation against the 64 bit result and remove the 'filter_counts_min & filter_counts_max parameters.
Here is a video of the test rig doing a homing and 2 QGL cycles on an intentionally skewed gantry from 10mm up with the filter enabled:
It couldn’t get through this reliably before the filter. Now its rock solid. I also get repeatable QGL results. Back-to-back QGL runs actually report a tolerance improvement based on the final adjustment (e.g. 0.006 → 0.0034).
With the Euclid probe in this machine I always had it report out of tolerance on the second QGL attempt. It would run 2 passes to re-adjust. It never got much below the 0.0075 threshold and it was doing 3 samples, so it had an advantage. I don’t know how to quantify how much better this is but this just feels like a reliable, repeatable instrument now.
QGL Logs:
11:07:29 $ QUAD_GANTRY_LEVEL
11:07:34 // probe at 50.000,25.000 is z=-0.608586
11:07:39 // probe at 50.000,225.000 is z=0.603268
11:07:45 // probe at 250.000,225.000 is z=0.471777
11:07:51 // probe at 250.000,25.000 is z=-0.936928
11:07:51 // Gantry-relative probe points:
// 0: 10.608586 1: 9.396732 2: 9.528223 3: 10.936928
11:07:51 // Actuator Positions:
// z: 10.621125 z1: 8.524312 z2: 8.500738 z3: 11.382986
11:07:51 // Average: 9.757290
11:07:51 // Making the following Z adjustments:
// stepper_z = -0.863835
// stepper_z1 = 1.232979
// stepper_z2 = 1.256552
// stepper_z3 = -1.625696
11:07:51 // Retries: 0/5 Probed points range: 1.540196 tolerance: 0.007500
11:07:58 // probe at 50.000,25.000 is z=0.284667
11:08:03 // probe at 50.000,225.000 is z=0.270929
11:08:09 // probe at 250.000,225.000 is z=0.255756
11:08:15 // probe at 250.000,25.000 is z=0.258762
11:08:15 // Gantry-relative probe points:
// 0: 9.715333 1: 9.729071 2: 9.744244 3: 9.741238
11:08:15 // Actuator Positions:
// z: 9.697648 z1: 9.734965 z2: 9.750489 z3: 9.755992
11:08:15 // Average: 9.734774
11:08:15 // Making the following Z adjustments:
// stepper_z = 0.037126
// stepper_z1 = -0.000191
// stepper_z2 = -0.015715
// stepper_z3 = -0.021219
11:08:15 // Retries: 1/5 Probed points range: 0.028911 tolerance: 0.007500
11:08:21 // probe at 50.000,25.000 is z=0.263198
11:08:26 // probe at 50.000,225.000 is z=0.267573
11:08:32 // probe at 250.000,225.000 is z=0.265907
11:08:38 // probe at 250.000,25.000 is z=0.261379
11:08:38 // Gantry-relative probe points:
// 0: 9.736802 1: 9.732427 2: 9.734093 3: 9.738621
11:08:38 // Actuator Positions:
// z: 9.736554 z1: 9.728399 z2: 9.731668 z3: 9.740427
11:08:38 // Average: 9.734262
11:08:38 // Making the following Z adjustments:
// stepper_z = -0.002292
// stepper_z1 = 0.005863
// stepper_z2 = 0.002594
// stepper_z3 = -0.006166
11:08:38 // Retries: 2/5 Probed points range: 0.006194 tolerance: 0.007500
11:08:49 $ QUAD_GANTRY_LEVEL
11:08:54 // probe at 50.000,25.000 is z=0.264376
11:09:00 // probe at 50.000,225.000 is z=0.262212
11:09:05 // probe at 250.000,225.000 is z=0.264595
11:09:11 // probe at 250.000,25.000 is z=0.265642
11:09:11 // Gantry-relative probe points:
// 0: 9.735624 1: 9.737788 2: 9.735405 3: 9.734358
11:09:11 // Actuator Positions:
// z: 9.735834 z1: 9.741112 z2: 9.734409 z3: 9.733586
11:09:11 // Average: 9.736235
11:09:11 // Making the following Z adjustments:
// stepper_z = 0.000401
// stepper_z1 = -0.004877
// stepper_z2 = 0.001826
// stepper_z3 = 0.002649
11:09:11 // Retries: 0/5 Probed points range: 0.003430 tolerance: 0.007500