*AI Summary*
*# *Reviewer Group: ML Infrastructure (MLInfra) and Systems Architecture Specialists**
This topic is best reviewed by *Senior ML Infrastructure Architects and Distributed Systems Engineers.* These professionals are responsible for the orchestration, scaling, and cost-optimization of LLM and Diffusion model deployments. They focus on hardware utilization, latency-sensitive Service Level Objectives (SLOs), and the co-evolution of model architectures and system backends.
---
### *Abstract*
This technical presentation by Hao Zhang (UC San Diego) details the architectural paradigm shift in AI inference from 2025 into 2026. The core of the talk addresses the transition from "continuous batching" to "disaggregated prefill and decode (PD)" serving, which optimizes "goodput"—the measure of throughput that adheres to specific latency budgets (TTFT and TPOT).
The second half explores emerging frontiers: *Attention-FFN Disaggregation (AFD)* and *Video Diffusion (DIT).* AFD proposes splitting internal transformer modules to maximize utilization in Mixture-of-Experts (MoE) models, utilizing "ping-pong" pipelining to mask communication overhead. The discussion concludes with the systemic challenges of Video Diffusion Transformers, which require processing massive sequence lengths (115k+ tokens) across iterative diffusion steps, necessitating next-generation inference engines like "FastVideo" to move toward real-time 4K generation.
---
### *Inference Systems Evolution: Disaggregation and Video Diffusion*
* *0:00 – Introduction:* Hao Zhang (UCSD/Disserv) provides a roadmap for the talk, focusing on the 2025 trend of Prefill/Decode disaggregation and 2026 projections for internal module splitting and video workloads.
* *1:41 – The "Goodput" Metric:* Effective inference is defined not just by raw throughput, but by "goodput"—throughput that satisfies two primary SLOs:
* *TTFT (Time to First Token):* Critical for user experience in chatbots.
* *TPOT (Time per Output Token):* Critical for high-speed summarization and reading speed.
* *4:43 – Continuous Batching vs. Disaggregation:* Standard continuous batching suffers from interference; a new prefill request (compute-bound) can spike the latency of an ongoing decode request (memory-bound). Disaggregation eliminates this by moving requests between dedicated "Prefill" and "Decode" workers.
* *7:44 – Strategic Partitioning:* Disaggregation allows for "Divide and Conquer" optimization. Prefill instances can use *Tensor Parallelism* to minimize TTFT, while Decode instances utilize *Data Parallelism* and larger batch sizes to maximize TPOT.
* *9:17 – Case Study: 2P1D Allocation:* Profiling shows that allocating two prefill workers to one decoder worker (2P1D) can double the goodput per GPU compared to co-located systems by balancing the specific resource demands of the workload.
* *11:12 – The XPYD Equation:* The core challenge of modern inference is solving for placement (how many P vs. D units) and communication (efficient KV-cache transfer between heterogeneous hardware).
* *12:55 – Industry Milestones (2025):*
* *DeepSeek-V3:* Successfully embraced PD disaggregation with specialized parameters.
* *NVIDIA Dynamo:* The current state-of-the-art production implementation, featuring KV-aware routers, GPU planners, and low-latency transfer layers.
* *17:06 – Trend 1: Attention-FFN Disaggregation (AFD):* The next evolution involves splitting the attention module from the FFN/MoE module within a single layer. This is particularly effective for MoE models where expert parallelism can be scaled independently from attention replicas.
* *19:21 – The Ping-Pong Pipeline:* To mitigate the "scary" per-layer communication overhead of AFD, systems use fused communication (combining AFD moves with existing MoE all-to-all) and "ping-pong" pipelining to overlap micro-batch computation with hidden state transfers.
* *22:55 – Trend 2: Video Diffusion (DIT):* Video generation is currently prohibitively expensive (approx. $10/minute of video). Unlike LLMs, Diffusion Transformers (DIT) must run the same stack 50–100 times per generation across multiple diffusion timesteps.
* *25:50 – The 115k Token Challenge:* In models like Hunyuan Video, a 5-second 720p clip results in a sequence length of 115k tokens. Over 80% of compute time is spent on quadratic attention, making current single-GPU generation (16 minutes on an H100) impractical for production.
* *27:18 – FastVideo and Real-Time Goals:* The "FastVideo" engine aims to optimize attention kernels and memory layout to achieve real-time 1080p and 4K video generation in 2026 by converging diffusion techniques with large-scale language model inference architectures.
AI-generated summary created with gemini-3-flash-preview for free via RocketRecap-dot-com. (Input: 21,735 tokens, Output: 1,121 tokens, Est. cost: $0.0142).Below, I will provide input for an example video (comprising of title, description, and transcript, in this order) and the corresponding abstract and summary I expect. Afterward, I will provide a new transcript that I want a summarization in the same format.
**Please give an abstract of the transcript and then summarize the transcript in a self-contained bullet list format.** Include starting timestamps, important details and key takeaways.
Example Input:
Fluidigm Polaris Part 2- illuminator and camera
mikeselectricstuff
131K subscribers
Subscribed
369
Share
Download
Clip
Save
5,857 views Aug 26, 2024
Fluidigm Polaris part 1 : • Fluidigm Polaris (Part 1) - Biotech g...
Ebay listings: https://www.ebay.co.uk/usr/mikeselect...
Merch https://mikeselectricstuff.creator-sp...
Transcript
Follow along using the transcript.
Show transcript
mikeselectricstuff
131K subscribers
Videos
About
Support on Patreon
40 Comments
@robertwatsonbath
6 hours ago
Thanks Mike. Ooof! - with the level of bodgery going on around 15:48 I think shame would have made me do a board re spin, out of my own pocket if I had to.
1
Reply
@Muonium1
9 hours ago
The green LED looks different from the others and uses phosphor conversion because of the "green gap" problem where green InGaN emitters suffer efficiency droop at high currents. Phosphide based emitters don't start becoming efficient until around 600nm so also can't be used for high power green emitters. See the paper and plot by Matthias Auf der Maur in his 2015 paper on alloy fluctuations in InGaN as the cause of reduced external quantum efficiency at longer (green) wavelengths.
4
Reply
1 reply
@tafsirnahian669
10 hours ago (edited)
Can this be used as an astrophotography camera?
Reply
mikeselectricstuff
·
1 reply
@mikeselectricstuff
6 hours ago
Yes, but may need a shutter to avoid light during readout
Reply
@2010craggy
11 hours ago
Narrowband filters we use in Astronomy (Astrophotography) are sided- they work best passing light in one direction so I guess the arrows on the filter frames indicate which way round to install them in the filter wheel.
1
Reply
@vitukz
12 hours ago
A mate with Channel @extractions&ire could use it
2
Reply
@RobertGallop
19 hours ago
That LED module says it can go up to 28 amps!!! 21 amps for 100%. You should see what it does at 20 amps!
Reply
@Prophes0r
19 hours ago
I had an "Oh SHIT!" moment when I realized that the weird trapezoidal shape of that light guide was for keystone correction of the light source.
Very clever.
6
Reply
@OneBiOzZ
20 hours ago
given the cost of the CCD you think they could have run another PCB for it
9
Reply
@tekvax01
21 hours ago
$20 thousand dollars per minute of run time!
1
Reply
@tekvax01
22 hours ago
"We spared no expense!" John Hammond Jurassic Park.
*(that's why this thing costs the same as a 50-seat Greyhound Bus coach!)
Reply
@florianf4257
22 hours ago
The smearing on the image could be due to the fact that you don't use a shutter, so you see brighter stripes under bright areas of the image as you still iluminate these pixels while the sensor data ist shifted out towards the top. I experienced this effect back at university with a LN-Cooled CCD for Spectroscopy. The stripes disapeared as soon as you used the shutter instead of disabling it in the open position (but fokussing at 100ms integration time and continuous readout with a focal plane shutter isn't much fun).
12
Reply
mikeselectricstuff
·
1 reply
@mikeselectricstuff
12 hours ago
I didn't think of that, but makes sense
2
Reply
@douro20
22 hours ago (edited)
The red LED reminds me of one from Roithner Lasertechnik. I have a Symbol 2D scanner which uses two very bright LEDs from that company, one red and one red-orange. The red-orange is behind a lens which focuses it into an extremely narrow beam.
1
Reply
@RicoElectrico
23 hours ago
PFG is Pulse Flush Gate according to the datasheet.
Reply
@dcallan812
23 hours ago
Very interesting. 2x
Reply
@littleboot_
1 day ago
Cool interesting device
Reply
@dav1dbone
1 day ago
I've stripped large projectors, looks similar, wonder if some of those castings are a magnesium alloy?
Reply
@kevywevvy8833
1 day ago
ironic that some of those Phlatlight modules are used in some of the cheapest disco lights.
1
Reply
1 reply
@bill6255
1 day ago
Great vid - gets right into subject in title, its packed with information, wraps up quickly. Should get a YT award! imho
3
Reply
@JAKOB1977
1 day ago (edited)
The whole sensor module incl. a 5 grand 50mpix sensor for 49 £.. highest bid atm
Though also a limited CCD sensor, but for the right buyer its a steal at these relative low sums.
Architecture Full Frame CCD (Square Pixels)
Total Number of Pixels 8304 (H) × 6220 (V) = 51.6 Mp
Number of Effective Pixels 8208 (H) × 6164 (V) = 50.5 Mp
Number of Active Pixels 8176 (H) × 6132 (V) = 50.1 Mp
Pixel Size 6.0 m (H) × 6.0 m (V)
Active Image Size 49.1 mm (H) × 36.8 mm (V)
61.3 mm (Diagonal),
645 1.1x Optical Format
Aspect Ratio 4:3
Horizontal Outputs 4
Saturation Signal 40.3 ke−
Output Sensitivity 31 V/e−
Quantum Efficiency
KAF−50100−CAA
KAF−50100−AAA
KAF−50100−ABA (with Lens)
22%, 22%, 16% (Peak R, G, B)
25%
62%
Read Noise (f = 18 MHz) 12.5 e−
Dark Signal (T = 60°C) 42 pA/cm2
Dark Current Doubling Temperature 5.7°C
Dynamic Range (f = 18 MHz) 70.2 dB
Estimated Linear Dynamic Range
(f = 18 MHz)
69.3 dB
Charge Transfer Efficiency
Horizontal
Vertical
0.999995
0.999999
Blooming Protection
(4 ms Exposure Time)
800X Saturation Exposure
Maximum Date Rate 18 MHz
Package Ceramic PGA
Cover Glass MAR Coated, 2 Sides or
Clear Glass
Features
• TRUESENSE Transparent Gate Electrode
for High Sensitivity
• Ultra-High Resolution
• Board Dynamic Range
• Low Noise Architecture
• Large Active Imaging Area
Applications
• Digitization
• Mapping/Aerial
• Photography
• Scientific
Thx for the tear down Mike, always a joy
Reply
@martinalooksatthings
1 day ago
15:49 that is some great bodging on of caps, they really didn't want to respin that PCB huh
8
Reply
@RhythmGamer
1 day ago
Was depressed today and then a new mike video dropped and now I’m genuinely happy to get my tear down fix
1
Reply
@dine9093
1 day ago (edited)
Did you transfrom into Mr Blobby for a moment there?
2
Reply
@NickNorton
1 day ago
Thanks Mike. Your videos are always interesting.
5
Reply
@KeritechElectronics
1 day ago
Heavy optics indeed... Spare no expense, cost no object. Splendid build quality. The CCD is a thing of beauty!
1
Reply
@YSoreil
1 day ago
The pricing on that sensor is about right, I looked in to these many years ago when they were still in production since it's the only large sensor you could actually buy. Really cool to see one in the wild.
2
Reply
@snik2pl
1 day ago
That leds look like from led projector
Reply
@vincei4252
1 day ago
TDI = Time Domain Integration ?
1
Reply
@wolpumba4099
1 day ago (edited)
Maybe the camera should not be illuminated during readout.
From the datasheet of the sensor (Onsemi): saturation 40300 electrons, read noise 12.5 electrons per pixel @ 18MHz (quite bad). quantum efficiency 62% (if it has micro lenses), frame rate 1 Hz. lateral overflow drain to prevent blooming protects against 800x (factor increases linearly with exposure time) saturation exposure (32e6 electrons per pixel at 4ms exposure time), microlens has +/- 20 degree acceptance angle
i guess it would be good for astrophotography
4
Reply
@txm100
1 day ago (edited)
Babe wake up a new mikeselectricstuff has dropped!
9
Reply
@vincei4252
1 day ago
That looks like a finger-lakes filter wheel, however, for astronomy they'd never use such a large stepper.
1
Reply
@MRooodddvvv
1 day ago
yaaaaay ! more overcomplicated optical stuff !
4
Reply
1 reply
@NoPegs
1 day ago
He lives!
11
Reply
1 reply
Transcript
0:00
so I've stripped all the bits of the
0:01
optical system so basically we've got
0:03
the uh the camera
0:05
itself which is mounted on this uh very
0:09
complex
0:10
adjustment thing which obviously to set
0:13
you the various tilt and uh alignment
0:15
stuff then there's two of these massive
0:18
lenses I've taken one of these apart I
0:20
think there's something like about eight
0:22
or nine Optical elements in here these
0:25
don't seem to do a great deal in terms
0:26
of electr magnification they're obiously
0:28
just about getting the image to where it
0:29
uh where it needs to be just so that
0:33
goes like that then this Optical block I
0:36
originally thought this was made of some
0:37
s crazy heavy material but it's just
0:39
really the sum of all these Optical bits
0:41
are just ridiculously heavy those lenses
0:43
are about 4 kilos each and then there's
0:45
this very heavy very solid um piece that
0:47
goes in the middle and this is so this
0:49
is the filter wheel assembly with a
0:51
hilariously oversized steper
0:53
motor driving this wheel with these very
0:57
large narrow band filters so we've got
1:00
various different shades of uh
1:03
filters there five Al together that
1:06
one's actually just showing up a silver
1:07
that's actually a a red but fairly low
1:10
transmission orangey red blue green
1:15
there's an excess cover on this side so
1:16
the filters can be accessed and changed
1:19
without taking anything else apart even
1:21
this is like ridiculous it's like solid
1:23
aluminium this is just basically a cover
1:25
the actual wavelengths of these are um
1:27
488 525 570 630 and 700 NM not sure what
1:32
the suffix on that perhaps that's the uh
1:34
the width of the spectral line say these
1:37
are very narrow band filters most of
1:39
them are you very little light through
1:41
so it's still very tight narrow band to
1:43
match the um fluoresence of the dies
1:45
they're using in the biochemical process
1:48
and obviously to reject the light that's
1:49
being fired at it from that Illuminator
1:51
box and then there's a there's a second
1:53
one of these lenses then the actual sort
1:55
of samples below that so uh very serious
1:58
amount of very uh chunky heavy Optics
2:01
okay let's take a look at this light
2:02
source made by company Lumen Dynamics
2:04
who are now part of
2:06
excelitas self-contained unit power
2:08
connector USB and this which one of the
2:11
Cable Bundle said was a TTL interface
2:14
USB wasn't used in uh the fluid
2:17
application output here and I think this
2:19
is an input for um light feedback I
2:21
don't if it's regulated or just a measur
2:23
measurement facility and the uh fiber
2:27
assembly
2:29
Square Inlet there and then there's two
2:32
outputs which have uh lens assemblies
2:35
and this small one which goes back into
2:37
that small Port just Loops out of here
2:40
straight back in So on this side we've
2:42
got the electronics which look pretty
2:44
straightforward we've got a bit of power
2:45
supply stuff over here and we've got
2:48
separate drivers for each wavelength now
2:50
interesting this is clearly been very
2:52
specifically made for this application
2:54
you I was half expecting like say some
2:56
generic drivers that could be used for a
2:58
number of different things but actually
3:00
literally specified the exact wavelength
3:02
on the PCB there is provision here for
3:04
385 NM which isn't populated but this is
3:07
clearly been designed very specifically
3:09
so these four drivers look the same but
3:10
then there's two higher power ones for
3:12
575 and
3:14
520 a slightly bigger heat sink on this
3:16
575 section there a p 24 which is
3:20
providing USB interface USB isolator the
3:23
USB interface just presents as a comport
3:26
I did have a quick look but I didn't
3:27
actually get anything sensible um I did
3:29
dump the Pi code out and there's a few
3:31
you a few sort of commands that you
3:32
could see in text but I didn't actually
3:34
manage to get it working properly I
3:36
found some software for related version
3:38
but it didn't seem to want to talk to it
3:39
but um I say that wasn't used for the
3:41
original application it might be quite
3:42
interesting to get try and get the Run
3:44
hours count out of it and the TTL
3:46
interface looks fairly straightforward
3:48
we've got positions for six opto
3:50
isolators but only five five are
3:52
installed so that corresponds with the
3:54
unused thing so I think this hopefully
3:56
should be as simple as just providing a
3:57
ttrl signal for each color to uh enable
4:00
it a big heat sink here which is there I
4:03
think there's like a big S of metal
4:04
plate through the middle of this that
4:05
all the leads are mounted on the other
4:07
side so this is heat sinking it with a
4:09
air flow from a uh just a fan in here
4:13
obviously don't have the air flow
4:14
anywhere near the Optics so conduction
4:17
cool through to this plate that's then
4:18
uh air cooled got some pots which are
4:21
presumably power
4:22
adjustments okay let's take a look at
4:24
the other side which is uh much more
4:27
interesting see we've got some uh very
4:31
uh neatly Twisted cable assemblies there
4:35
a bunch of leads so we've got one here
4:37
475 up here 430 NM 630 575 and 520
4:44
filters and dcro mirrors a quick way to
4:48
see what's white is if we just shine
4:49
some white light through
4:51
here not sure how it is is to see on the
4:54
camera but shining white light we do
4:55
actually get a bit of red a bit of blue
4:57
some yellow here so the obstacle path
5:00
575 it goes sort of here bounces off
5:03
this mirror and goes out the 520 goes
5:07
sort of down here across here and up
5:09
there 630 goes basically straight
5:13
through
5:15
430 goes across there down there along
5:17
there and the 475 goes down here and
5:20
left this is the light sensing thing
5:22
think here there's just a um I think
5:24
there a photo diode or other sensor
5:26
haven't actually taken that off and
5:28
everything's fixed down to this chunk of
5:31
aluminium which acts as the heat
5:32
spreader that then conducts the heat to
5:33
the back side for the heat
5:35
sink and the actual lead packages all
5:38
look fairly similar except for this one
5:41
on the 575 which looks quite a bit more
5:44
substantial big spay
5:46
Terminals and the interface for this
5:48
turned out to be extremely simple it's
5:50
literally a 5V TTL level to enable each
5:54
color doesn't seem to be any tensity
5:56
control but there are some additional
5:58
pins on that connector that weren't used
5:59
in the through time thing so maybe
6:01
there's some extra lines that control
6:02
that I couldn't find any data on this uh
6:05
unit and the um their current product
6:07
range is quite significantly different
6:09
so we've got the uh blue these
6:13
might may well be saturating the camera
6:16
so they might look a bit weird so that's
6:17
the 430
6:18
blue the 575
6:24
yellow uh
6:26
475 light blue
6:29
the uh 520
6:31
green and the uh 630 red now one
6:36
interesting thing I noticed for the
6:39
575 it's actually it's actually using a
6:42
white lead and then filtering it rather
6:44
than using all the other ones are using
6:46
leads which are the fundamental colors
6:47
but uh this is actually doing white and
6:50
it's a combination of this filter and
6:52
the dichroic mirrors that are turning to
6:55
Yellow if we take the filter out and a
6:57
lot of the a lot of the um blue content
7:00
is going this way the red is going
7:02
straight through these two mirrors so
7:05
this is clearly not reflecting much of
7:08
that so we end up with the yellow coming
7:10
out of uh out of there which is a fairly
7:14
light yellow color which you don't
7:16
really see from high intensity leads so
7:19
that's clearly why they've used the
7:20
white to uh do this power consumption of
7:23
the white is pretty high so going up to
7:25
about 2 and 1 half amps on that color
7:27
whereas most of the other colors are
7:28
only drawing half an amp or so at 24
7:30
volts the uh the green is up to about
7:32
1.2 but say this thing is uh much
7:35
brighter and if you actually run all the
7:38
colors at the same time you get a fairly
7:41
reasonable um looking white coming out
7:43
of it and one thing you might just be
7:45
out to notice is there is some sort
7:46
color banding around here that's not
7:49
getting uh everything s completely
7:51
concentric and I think that's where this
7:53
fiber optic thing comes
7:58
in I'll
8:00
get a couple of Fairly accurately shaped
8:04
very sort of uniform color and looking
8:06
at What's um inside here we've basically
8:09
just got this Square Rod so this is
8:12
clearly yeah the lights just bouncing
8:13
off all the all the various sides to um
8:16
get a nice uniform illumination uh this
8:19
back bit looks like it's all potted so
8:21
nothing I really do to get in there I
8:24
think this is fiber so I have come
8:26
across um cables like this which are
8:27
liquid fill but just looking through the
8:30
end of this it's probably a bit hard to
8:31
see it does look like there fiber ends
8:34
going going on there and so there's this
8:36
feedback thing which is just obviously
8:39
compensating for the any light losses
8:41
through here to get an accurate
8:43
representation of uh the light that's
8:45
been launched out of these two
8:47
fibers and you see uh
8:49
these have got this sort of trapezium
8:54
shape light guides again it's like a
8:56
sort of acrylic or glass light guide
9:00
guess projected just to make the right
9:03
rectangular
9:04
shape and look at this Center assembly
9:07
um the light output doesn't uh change
9:10
whether you feed this in or not so it's
9:11
clear not doing any internal Clos Loop
9:14
control obviously there may well be some
9:16
facility for it to do that but it's not
9:17
being used in this
9:19
application and so this output just
9:21
produces a voltage on the uh outle
9:24
connector proportional to the amount of
9:26
light that's present so there's a little
9:28
diffuser in the back there
9:30
and then there's just some kind of uh
9:33
Optical sensor looks like a
9:35
chip looking at the lead it's a very
9:37
small package on the PCB with this lens
9:40
assembly over the top and these look
9:43
like they're actually on a copper
9:44
Metalized PCB for maximum thermal
9:47
performance and yeah it's a very small
9:49
package looks like it's a ceramic
9:51
package and there's a thermister there
9:53
for temperature monitoring this is the
9:56
475 blue one this is the 520 need to
9:59
Green which is uh rather different OB
10:02
it's a much bigger D with lots of bond
10:04
wise but also this looks like it's using
10:05
a phosphor if I shine a blue light at it
10:08
lights up green so this is actually a
10:10
phosphor conversion green lead which
10:12
I've I've come across before they want
10:15
that specific wavelength so they may be
10:17
easier to tune a phosphor than tune the
10:20
um semiconductor material to get the uh
10:23
right right wavelength from the lead
10:24
directly uh red 630 similar size to the
10:28
blue one or does seem to have a uh a
10:31
lens on top of it there is a sort of red
10:33
coloring to
10:35
the die but that doesn't appear to be
10:38
fluorescent as far as I can
10:39
tell and the white one again a little
10:41
bit different sort of much higher
10:43
current
10:46
connectors a makeer name on that
10:48
connector flot light not sure if that's
10:52
the connector or the lead
10:54
itself and obviously with the phosphor
10:56
and I'd imagine that phosphor may well
10:58
be tuned to get the maximum to the uh 5
11:01
cenm and actually this white one looks
11:04
like a St fairly standard product I just
11:06
found it in Mouse made by luminous
11:09
devices in fact actually I think all
11:11
these are based on various luminous
11:13
devices modules and they're you take
11:17
looks like they taking the nearest
11:18
wavelength and then just using these
11:19
filters to clean it up to get a precise
11:22
uh spectral line out of it so quite a
11:25
nice neat and um extreme
11:30
bright light source uh sure I've got any
11:33
particular use for it so I think this
11:35
might end up on
11:36
eBay but uh very pretty to look out and
11:40
without the uh risk of burning your eyes
11:43
out like you do with lasers so I thought
11:45
it would be interesting to try and
11:46
figure out the runtime of this things
11:48
like this we usually keep some sort
11:49
record of runtime cuz leads degrade over
11:51
time I couldn't get any software to work
11:52
through the USB face but then had a
11:54
thought probably going to be writing the
11:55
runtime periodically to the e s prom so
11:58
I just just scope up that and noticed it
12:00
was doing right every 5 minutes so I
12:02
just ran it for a while periodically
12:04
reading the E squ I just held the pick
12:05
in in reset and um put clip over to read
12:07
the square prom and found it was writing
12:10
one location per color every 5 minutes
12:12
so if one color was on it would write
12:14
that location every 5 minutes and just
12:16
increment it by one so after doing a few
12:18
tests with different colors of different
12:19
time periods it looked extremely
12:21
straightforward it's like a four bite
12:22
count for each color looking at the
12:24
original data that was in it all the
12:26
colors apart from Green were reading
12:28
zero and the green was reading four
12:30
indicating a total 20 minutes run time
12:32
ever if it was turned on run for a short
12:34
time then turned off that might not have
12:36
been counted but even so indicates this
12:37
thing wasn't used a great deal the whole
12:40
s process of doing a run can be several
12:42
hours but it'll only be doing probably
12:43
the Imaging at the end of that so you
12:46
wouldn't expect to be running for a long
12:47
time but say a single color for 20
12:50
minutes over its whole lifetime does
12:52
seem a little bit on the low side okay
12:55
let's look at the camera un fortunately
12:57
I managed to not record any sound when I
12:58
did this it's also a couple of months
13:00
ago so there's going to be a few details
13:02
that I've forgotten so I'm just going to
13:04
dub this over the original footage so um
13:07
take the lid off see this massive great
13:10
heat sink so this is a pel cool camera
13:12
we've got this blower fan producing a
13:14
fair amount of air flow through
13:16
it the connector here there's the ccds
13:19
mounted on the board on the
13:24
right this unplugs so we've got a bit of
13:27
power supply stuff on here
13:29
USB interface I think that's the Cyprus
13:32
microcontroller High speeded USB
13:34
interface there's a zyink spon fpga some
13:40
RAM and there's a couple of ATD
13:42
converters can't quite read what those
13:45
those are but anal
13:47
devices um little bit of bodgery around
13:51
here extra decoupling obviously they
13:53
have having some noise issues this is
13:55
around the ram chip quite a lot of extra
13:57
capacitors been added there
13:59
uh there's a couple of amplifiers prior
14:01
to the HD converter buffers or Andor
14:05
amplifiers taking the CCD
14:08
signal um bit more power spy stuff here
14:11
this is probably all to do with
14:12
generating the various CCD bias voltages
14:14
they uh need quite a lot of exotic
14:18
voltages next board down is just a
14:20
shield and an interconnect
14:24
boardly shielding the power supply stuff
14:26
from some the more sensitive an log
14:28
stuff
14:31
and this is the bottom board which is
14:32
just all power supply
14:34
stuff as you can see tons of capacitors
14:37
or Transformer in
14:42
there and this is the CCD which is a uh
14:47
very impressive thing this is a kf50 100
14:50
originally by true sense then codec
14:53
there ON
14:54
Semiconductor it's 50 megapixels uh the
14:58
only price I could find was this one
15:00
5,000 bucks and the architecture you can
15:03
see there actually two separate halves
15:04
which explains the Dual AZ converters
15:06
and two amplifiers it's literally split
15:08
down the middle and duplicated so it's
15:10
outputting two streams in parallel just
15:13
to keep the bandwidth sensible and it's
15:15
got this amazing um diffraction effects
15:18
it's got micro lenses over the pixel so
15:20
there's there's a bit more Optics going
15:22
on than on a normal
15:25
sensor few more bodges on the CCD board
15:28
including this wire which isn't really
15:29
tacked down very well which is a bit uh
15:32
bit of a mess quite a few bits around
15:34
this board where they've uh tacked
15:36
various bits on which is not super
15:38
impressive looks like CCD drivers on the
15:40
left with those 3 ohm um damping
15:43
resistors on the
15:47
output get a few more little bodges
15:50
around here some of
15:52
the and there's this separator the
15:54
silica gel to keep the moisture down but
15:56
there's this separator that actually
15:58
appears to be cut from piece of
15:59
antistatic
16:04
bag and this sort of thermal block on
16:06
top of this stack of three pel Cola
16:12
modules so as with any Stacks they get
16:16
um larger as they go back towards the
16:18
heat sink because each P's got to not
16:20
only take the heat from the previous but
16:21
also the waste heat which is quite
16:27
significant you see a little temperature
16:29
sensor here that copper block which
16:32
makes contact with the back of the
16:37
CCD and this's the back of the
16:40
pelas this then contacts the heat sink
16:44
on the uh rear there a few thermal pads
16:46
as well for some of the other power
16:47
components on this
16:51
PCB okay I've connected this uh camera
16:54
up I found some drivers on the disc that
16:56
seem to work under Windows 7 couldn't
16:58
get to install under Windows 11 though
17:01
um in the absence of any sort of lens or
17:03
being bothered to the proper amount I've
17:04
just put some f over it and put a little
17:06
pin in there to make a pinhole lens and
17:08
software gives a few options I'm not
17:11
entirely sure what all these are there's
17:12
obviously a clock frequency 22 MHz low
17:15
gain and with PFG no idea what that is
17:19
something something game programmable
17:20
Something game perhaps ver exposure
17:23
types I think focus is just like a
17:25
continuous grab until you tell it to
17:27
stop not entirely sure all these options
17:30
are obviously exposure time uh triggers
17:33
there ex external hardware trigger inut
17:35
you just trigger using a um thing on
17:37
screen so the resolution is 8176 by
17:40
6132 and you can actually bin those
17:42
where you combine multiple pixels to get
17:46
increased gain at the expense of lower
17:48
resolution down this is a 10sec exposure
17:51
obviously of the pin hole it's very uh
17:53
intensitive so we just stand still now
17:56
downloading it there's the uh exposure
17:59
so when it's
18:01
um there's a little status thing down
18:03
here so that tells you the um exposure
18:07
[Applause]
18:09
time it's this is just it
18:15
downloading um it is quite I'm seeing
18:18
quite a lot like smearing I think that I
18:20
don't know whether that's just due to
18:21
pixels overloading or something else I
18:24
mean yeah it's not it's not um out of
18:26
the question that there's something not
18:27
totally right about this camera
18:28
certainly was bodge wise on there um I
18:31
don't I'd imagine a camera like this
18:32
it's got a fairly narrow range of
18:34
intensities that it's happy with I'm not
18:36
going to spend a great deal of time on
18:38
this if you're interested in this camera
18:40
maybe for astronomy or something and
18:42
happy to sort of take the risk of it may
18:44
not be uh perfect I'll um I think I'll
18:47
stick this on eBay along with the
18:48
Illuminator I'll put a link down in the
18:50
description to the listing take your
18:52
chances to grab a bargain so for example
18:54
here we see this vertical streaking so
18:56
I'm not sure how normal that is this is
18:58
on fairly bright scene looking out the
19:02
window if I cut the exposure time down
19:04
on that it's now 1 second
19:07
exposure again most of the image
19:09
disappears again this is looks like it's
19:11
possibly over still overloading here go
19:14
that go down to say say quarter a
19:16
second so again I think there might be
19:19
some Auto gain control going on here um
19:21
this is with the PFG option let's try
19:23
turning that off and see what
19:25
happens so I'm not sure this is actually
19:27
more streaking or which just it's
19:29
cranked up the gain all the dis display
19:31
gray scale to show what um you know the
19:33
range of things that it's captured
19:36
there's one of one of 12 things in the
19:38
software there's um you can see of you
19:40
can't seem to read out the temperature
19:42
of the pelta cooler but you can set the
19:44
temperature and if you said it's a
19:46
different temperature you see the power
19:48
consumption jump up running the cooler
19:50
to get the temperature you requested but
19:52
I can't see anything anywhere that tells
19:54
you whether the cool is at the at the
19:56
temperature other than the power
19:57
consumption going down and there's no
19:59
temperature read out
20:03
here and just some yeah this is just
20:05
sort of very basic software I'm sure
20:07
there's like an API for more
20:09
sophisticated
20:10
applications but so if you know anything
20:12
more about these cameras please um stick
20:14
in the
20:15
comments um incidentally when I was
20:18
editing I didn't notice there was a bent
20:19
pin on the um CCD but I did fix that
20:22
before doing these tests and also
20:24
reactivated the um silica gel desicant
20:26
cuz I noticed it was uh I was getting
20:28
bit of condensation on the window but um
20:31
yeah so a couple of uh interesting but
20:34
maybe not particularly uh useful pieces
20:37
of Kit except for someone that's got a
20:38
very specific use so um I'll stick a
20:42
I'll stick these on eBay put a link in
20:44
the description and say hopefully
20:45
someone could actually make some uh good
20:47
use of these things
Example Output:
**Abstract:**
This video presents Part 2 of a teardown focusing on the optical components of a Fluidigm Polaris biotechnology instrument, specifically the multi-wavelength illuminator and the high-resolution CCD camera.
The Lumen Dynamics illuminator unit is examined in detail, revealing its construction using multiple high-power LEDs (430nm, 475nm, 520nm, 575nm, 630nm) combined via dichroic mirrors and filters. A square fiber optic rod is used to homogenize the light. A notable finding is the use of a phosphor-converted white LED filtered to achieve the 575nm output. The unit features simple TTL activation for each color, conduction cooling, and internal homogenization optics. Analysis of its EEPROM suggests extremely low operational runtime.
The camera module teardown showcases a 50 Megapixel ON Semiconductor KAF-50100 CCD sensor with micro-lenses, cooled by a multi-stage Peltier stack. The control electronics include an FPGA and a USB interface. Significant post-manufacturing modifications ("bodges") are observed on the camera's circuit boards. Basic functional testing using vendor software and a pinhole lens confirms image capture but reveals prominent vertical streaking artifacts, the cause of which remains uncertain (potential overload, readout artifact, or fault).
**Exploring the Fluidigm Polaris: A Detailed Look at its High-End Optics and Camera System**
* **0:00 High-End Optics:** The system utilizes heavy, high-quality lenses and mirrors for precise imaging, weighing around 4 kilos each.
* **0:49 Narrow Band Filters:** A filter wheel with five narrow band filters (488, 525, 570, 630, and 700 nm) ensures accurate fluorescence detection and rejection of excitation light.
* **2:01 Customizable Illumination:** The Lumen Dynamics light source offers five individually controllable LED wavelengths (430, 475, 520, 575, 630 nm) with varying power outputs. The 575nm yellow LED is uniquely achieved using a white LED with filtering.
* **3:45 TTL Control:** The light source is controlled via a simple TTL interface, enabling easy on/off switching for each LED color.
* **12:55 Sophisticated Camera:** The system includes a 50-megapixel Kodak KAI-50100 CCD camera with a Peltier cooling system for reduced noise.
* **14:54 High-Speed Data Transfer:** The camera features dual analog-to-digital converters to manage the high data throughput of the 50-megapixel sensor, which is effectively two 25-megapixel sensors operating in parallel.
* **18:11 Possible Issues:** The video creator noted some potential issues with the camera, including image smearing.
* **18:11 Limited Dynamic Range:** The camera's sensor has a limited dynamic range, making it potentially challenging to capture scenes with a wide range of brightness levels.
* **11:45 Low Runtime:** Internal data suggests the system has seen minimal usage, with only 20 minutes of recorded runtime for the green LED.
* **20:38 Availability on eBay:** Both the illuminator and camera are expected to be listed for sale on eBay.
Here is the real transcript. What would be a good group of people to review this topic? Please summarize provide a summary like they would:
The Future of AI Inference Hao Zhang, Assistant Professor , University of California San Diego Rate NowShare Favorite Add to list This talk surveys the evolution of large-scale inference, showing how disaggregated prefill/decode serving boosts goodput and scales across GPU clusters. It also previews trends like attention–FFN disaggregation and video diffusion workloads, highlighting next-gen system solutions. Events & Trainings: Other Date: January 2026 Industry: All Industries Topic: Developer Tools & Techniques / AI Inference Level: Technical – Advanced Language: English
00:00 Hi everyone. It's a great pleasure to be here and a great pleasure to have the floor to give the last talk for this exciting NVIDIA Dynamo Day event.
00:10 And my name is Hao and I'm currently a faculty at UC San Diego and I work a lot in inference.
00:17 In the past I have been working on VLM on this serve and now I also work closely with Dynamo and a few other inference engine developers to push inference further.
00:27 So today's talk, I have two parts. So in the first part, I want to do a recap of the main developed trend in 2025 with model inference, which is essentially prefilled decode desegregation.
00:41 I think we paid a lot of effort on making this production ready and I really appreciate NVIDIA's effort in putting prefilled decode desegregation into Dynamo and making production ready.
00:55 And then in the second half, I will look ahead to talk about the two major trends. There are many, many trends, but I want to focus on two in AI inference.
01:04 Okay, let's get started with the first part. So we all know that the language model was the biggest thing in 2025, right?
01:11 And it has enabled so many new and interesting applications and now language model based applications such as coding agents is still the biggest thing in 2026.
01:20 And the different companies, they compete to build different types of agents on top of those language models and NVIDIA GPU is one of the most important resources to make sure language model runs.
01:32 But we know GPUs are expensive. So basically, in order to optimize language model inference, we are facing two fundamental problems.
01:41 The first is throughput. So how many requests per second we can handle given a fixed number of GPUs. And second is latency, of course, and how long users wait for the first response or for the subsequent tokens.
01:54 So therefore, all the language model endpoint providers and inference developers, they rush to achieve high throughput while also trying to keep latency low.
02:03 So at the first glance, if you have a lower latency, you should have a higher throughput because each request stays shorter in your GPU, right?
02:11 But in the language model world, since I'm more complex, and I think this is quite well understood today, so a system can achieve high throughput by batching requests, but still deliver poor user experience and high latency.
02:24 So let's say an example. So basically, in language model, there are two most important service level objectives, right? TTFT and TPOT. I think in the previous sessions, I think a lot of people have been mentioning these two terms.
02:38 Let me use examples to recap. So basically, a chatbot needs to have relatively fast initial response, right? TTFT, time to first token, in order to make the user experience better.
02:51 But after that, the following generative attacks doesn't help to be faster than the average human reading speed. So this means that the TPOT, which stands for time per output token, after the initial token generation doesn't help to be too fast to make a good user experience.
03:07 But in some other application, for example, in a summarization application, we can tolerate a longer initial response, but the subsequent generation should be fast because we want to see the summary as soon as possible, okay?
03:20 So basically, what we want is a system that can deliver high throughput while making sure issue requests stay within these two SLOs, search level objective. And we call this portion of this throughput, the goodput.
03:31 And you can imagine a system processing 10 requests per second, but only three of them finish before their deadline. And that basically yields a goodput of three requests per second.
03:43 So basically, this means that high throughput systems that meet this latency budget might provide poor user experience and waste capacity, right?
03:53 So basically, we want to have a hash report while also keeping the latency low. And this actually leads to a major trend of 2025 what we did and what we upgraded on the AI in presenting.
04:06 So basically, this is a very straightforward pre-fuel and decoding. And it is a very straightforward idea, which basically most people can understand in one minute from this animation. So the request first goes into the pre-fuel worker to finish the pre-fuel and then migrate into a decode worker.
04:20 And it starts decoding until it finishes the generation and then access your system. Yet it is also a very profound idea that has motivated a lot of more exciting research and development in 2025.
04:31 Next, I'm going to briefly do a recap on why disaggregated serene would yield significant improvement and then reveal some of our existing milestones we have achieved in 2025 in this direction.
04:43 Prior to disaggregation, the dominant technique was a continuous batching, which essentially is that we connect as many requests as possible and run one iteration of the model for all of them together.
04:58 And this improves throughput, but it forces a single GPU to handle both the prompt, which is a pre-fuel phase, and the subsequent like iterative decoding phase for all requests, right?
05:11 So because continuous batching colloquist, preview and decode, and the two phases actually interfere with each other, and I want to use this example to remind you of this interference.
05:23 So on the left, we batch two requests, R1 and R2, together in one GPU, and we can see that in the right window, when R2 arrives, R1's decode time, which is in blue, gets significantly increased because of R2's pre-fuel.
05:38 And similarly, R2's pre-fuel time, which is in green, also gets slightly increased.
05:43 So this is even worse when we have a system that is receiving an incoming stream of many requests, right?
05:48 Because issue requests might interfere with all previous requests, okay?
05:51 But on the right-hand side, if we separate this to R1 and R2 into two different GPUs, right?
06:00 They actually don't have any interference, and this is the one primary gain that we have by doing disaggregated prevent decode.
06:08 Another result of this interference, when services must satisfy both TTFT and TPOT SLOs, the system have to provision more resources to meet the latency goal, to basically condense this kind of latency, right?
06:22 Especially when either SLO is strict.
06:24 So in this example shown on this slide, in order to meet this latency goal, which was illustrated as this red dotted line, we have to increase the number of GPUs from one to four.
06:35 And this last system barely satisfy the latency requirement.
06:39 And there's one more problem with co-location.
06:44 That is the particle strategy, for example, tensor pipeline and data particle we used are inherently coupled for the preview and decoding computation.
06:53 I remember that the preview and decode has distinct computing patterns.
06:58 And because of this, the optimal particle strategy for the preview and decoding phases might be different.
07:05 For example, when the TTFT is tight and TPOT is loose, the preview phase prefers tensor partition to meet the tight latency target, whereas the decoding phase properly prefers data or pipeline partition to boost throughput.
07:19 And if we co-locate preview and decode, we don't have a choice to configure partner strategy separately for them.
07:26 OK, yeah, that's a pretty quick recap of the underlying rationale of why we gradually migrate all the universe engine from a co-located engine into a deserogated engine.
07:37 And aggregated preview and decoding basically unlocks a natural divide and conquer optimization.
07:44 Because of two phases have different performance characteristics, we can choose the most suitable partition and resource allocation for each.
07:51 And the preview instance is optimized for TTFT, often using tensor and data partition to maximize throughput on long prompts.
07:58 And the decoding instance is optimized for TPOT.
08:01 We can use wider batch size or extra partition like those in MOE to maximize the toques per second.
08:08 And also prefilled decode interference is immediately eliminated, right?
08:11 Because now we put them on different GPUs.
08:13 There's no interference anymore.
08:14 And we can treat the SRO certification problem as two independent optimizations.
08:18 And the result is a higher so-called good put at lower cost.
08:23 Let's see a very concrete example why this simple strategy would give us improvement.
08:29 And this is a real profile we did one year ago when we started working on that very early prototype of decoding.
08:36 So in this figure, in the X-axis we are increasing the request rate gradually,
08:41 where in the Y-axis we show the P-line TTFT and P-line TTPOT.
08:46 The reason this curve is upward is because when we have more and more requests coming,
08:49 we have some computational resources, right?
08:52 So each request will observe a slightly higher latency.
08:56 And we can see from the upper figure that this system can support roughly three requests per second
09:02 that will stay within this bar that is the TTFT latency constraint using one GPU, okay?
09:08 And for the TPOT, you can sustain roughly 1.6 RPS.
09:13 So therefore the overall system throughput is just the minimum of this two, right?
09:20 Which is 1.6 requests per second.
09:22 But if we disaggregate, right?
09:25 It seems to get significantly improved.
09:27 So first, because we are not free from that so-called interference between prevent decode,
09:32 we can observe that both the green curve and the blue curve,
09:38 they somehow get a little bit of lighter, right?
09:41 And this is because when we disaggregate these two phases,
09:43 we eliminate the interference.
09:44 And now each GPU only needs to handle one type of the task.
09:48 And as you can see, the prefill can achieve 5.6 requests per second,
09:54 and the decode can achieve 10 requests per second.
09:57 And more importantly now, we can flexibly allocate different GPUs for prevent decoder.
10:03 This is so-called independent resource allocation for prevent decoding,
10:08 which is enabled by this technique.
10:10 So here we can simply allocate two prefill workers to pair with one decoder workers,
10:15 which we call 2P1D, right?
10:17 And with these three GPUs in total, we can get roughly 10 requests per second for the whole system.
10:25 And if we divide it by the whole system, remember GPUs, three GPUs,
10:28 we get a 3.3 RPS per GPU goodput,
10:31 which is roughly twice of the goodput of the left-hand side,
10:34 which is a co-located system.
10:36 Okay, I hope this example and profile give you an impression on a high level rationale
10:42 why we want to do decelerated serving, okay?
10:47 And we can actually make this way more complex and more optimized
10:51 by, for example, optimizing different paradigm strategies
10:54 against prevent decoding workers, optimizing the resource allocation,
10:57 optimize the paradigm plans, and also optimize how we transfer TV cash.
11:01 But in a very, very generalized form, basically our core problem can be summarized
11:09 using a very, very simple equation, which I put on this slide.
11:12 We aim to solve this equation, which is XPYD, right?
11:17 Where P stands for prefill and D stands for decoding,
11:20 and XY is basically our allocation.
11:22 So this simple equation actually is very, very profound
11:24 because it maps to many, many subproblems, okay?
11:28 The two most important problem I put on this slide,
11:31 the first problem is placement.
11:33 So we need to decide how many prefill and decoding GPUs
11:36 to allocate and where to place it.
11:39 And we must meet workload requirement and maximize the GPU group.
11:42 Second, of course, communication.
11:44 Because now we are a deserogated system.
11:46 We have to transfer the TV cash between P and D.
11:49 So transferring the TV cash from P to D must be fast and efficient,
11:53 and minimizing this TV cash transfer time across, for example,
11:56 heterogeneous hardware resources becomes critical at scale.
12:00 And this question opens a rich space of research and development
12:04 across scheduling, communication, storage, and partisan plans.
12:08 So what comes initially is our first research prototype, Disserv.
12:16 And here is the initial simple architecture of Disserv.
12:19 We treat prefill worker and decoder worker as a first-class services
12:23 linked by a kvcache transfer engine.
12:27 And this engine basically direct requests and key value catches
12:31 between these two types of instance, prefill and decoding instance.
12:35 And we also have some algorithms to calculate
12:38 how many GPUs we should allocate for prefill,
12:40 how many for decoding, and calculate the optimal parameter strategy.
12:44 OK, and this architecture actually allows this Disserv system
12:48 to run across many, many GPUs with high efficiency.
12:51 And it also inspired production systems later on.
12:54 OK, and a fun bit of the history.
12:58 So basically we published an open source Disserv
13:01 at the end of 2013 alongside a concurrent paper
13:05 Splitwise from Microsoft.
13:07 And throughout this 2024, the adoption of this so-called
13:12 Disservative Prevent Decoding was slow
13:14 because engineering teams were very busy integrating,
13:18 for example, page attention and other batching optimizations.
13:22 And also because the benefit of Disservization in 2024
13:26 wasn't very obvious yet.
13:29 Yet quietly, large companies like Badlands, like Google,
13:32 they started replacing continuous batching with Disservization.
13:35 And in early 2025, DeepSeq Way3 publicly embraced PD
13:41 deserrogation with different parameters for each phase.
13:44 And later in the year, I think NVIDIA highlights
13:49 deserrogation and the Dalamo in the GTC 2025.
13:52 And by the end of 2025, almost all production systems
13:56 adopted this architecture.
13:59 And throughout the course of 2025,
14:03 we have witnessed a lot of very, very greater development
14:06 that pushed the Disservative survey further.
14:09 Here, I wanted to highlight three very interesting ones.
14:13 RMCatch, DeepSeq Way3, and of course, one of the best,
14:17 which is NVIDIA Dalamo.
14:18 And we actually wrote a review article last year
14:21 in my last blog post website,
14:25 and feel free to check it out.
14:27 The title is Disservative Inference 18 Month Later.
14:30 So RMCatch shows that deserrogation is not just about compute, right?
14:36 Storage matters.
14:38 RMCatch pipelines KVCatch movement between periphery and decoding,
14:42 and the batching data movement operations
14:44 and decoupling cache storage from the inference engine
14:47 to improve scalability.
14:49 And this innovation makes it possible to support
14:51 the Sonidome GPUs efficiently.
14:53 And NVIDIA Dalamo goes further by combining deserrogation
14:57 with the GPU planner and a KVCatch aware router
14:59 that uses a low-latency transfer layer.
15:02 And this design also delivers state-of-the-art results
15:05 on the latest GP200 and VL72 systems.
15:08 And DeepSeq Way3 actually is a great, great example
15:13 how deserrogation enables specialized partism.
15:16 And during pre-fill,
15:17 it uses a moderate tensile data partism
15:21 to process large prompts efficiently.
15:23 And during decode,
15:24 it employs very wide expert partism, 256 or even more experts
15:29 with high degrees of data partism
15:31 to maximize GEMM utilization.
15:35 And the system also uses specialized storage back-end 3FS
15:38 to aggregate the throughput of thousands of SSDs
15:41 and network bandwidth for QVCatch transfer.
15:44 And this innovation shows how deserrogated principle
15:47 can be extended to MOE models and to long context inference.
15:51 And the NVIDIA Dynamo is perhaps the most advanced
15:55 and the complete production implementation
15:57 of pre-fill, decoding, deserrogation.
16:00 And it treats pre-fill and decode workers
16:02 as separate microservices connected
16:04 while this so-called KV-aware router
16:06 and the GPU planner profiles,
16:08 GPUs, and auto-scales resources.
16:10 And NVIDIA also develops the NICSO
16:12 which unifies NVIDIA in the InfiniBand
16:14 and the PCIe and SSD fabrics
16:21 under a single low-latency abstraction.
16:24 And Dynamo also supports popular inference engines
16:27 like the TRTRM or VRMSG9
16:29 and has demonstrated Daphne's SOTA performance
16:33 on media isolation hardware.
16:35 And it shows that deserrogation is ready for production
16:38 at the data center scale.
16:39 Okay, that pretty much wraps my first part
16:43 which talked about the progress we have made in 2025
16:46 in pushing deserrogated inference into production.
16:49 In my second part, I want to basically give you
16:52 an outlook on what's going on now.
16:54 For example, what my team is working on
16:56 and what the community is working on.
16:57 So there are many trends, right?
17:01 In language model inference, in AI inference,
17:03 but here I want to really highlight two.
17:06 So looking ahead into 2026,
17:08 we see two major trends in AI inference.
17:11 So first, this deserrogation principle
17:14 is moving deeper into the model, right?
17:15 And researchers are splitting attention
17:18 and also the feed-forward or MOE layers
17:21 within each transformer layer
17:23 and running them independently on different sets
17:25 or even different types of GPUs.
17:27 So second, many of you probably already heard about this, right?
17:31 DPUT models, particularly for video generation,
17:33 are gaining more and more traction.
17:36 And the DPUT model and the language model,
17:38 their boundary are kind of blurred
17:40 because the latest generation model,
17:42 they actually have both components, right?
17:44 And their unique compute patterns
17:47 and massive sequence lengths
17:49 require rethinking our inference systems.
17:51 And in the next few slides,
17:54 I will unpack each trend
17:55 and try to explain why they matter.
17:57 So this attention FFN deserrogation
18:02 or in short AFD takes the idea
18:05 of splitting workloads one step further.
18:08 So instead of treating each transformer layer
18:11 as a modernistic block,
18:12 this AFD basically decouples the attention module
18:15 from the feed-forward or MOE module.
18:17 And this attention module is replicated,
18:20 for example, using data paradigms
18:22 while this FFN or MOE is scaled
18:25 with another different form of paradigms.
18:27 I think it basically share the same rationale
18:29 as we did for preview and decode deserrogation, right?
18:31 Because we want to indeed allocate
18:33 a different number of resources
18:34 and we want to adopt the different types
18:37 of paradigms plans for these two modules.
18:40 The reason this will be beneficial is because
18:42 the attention and the expert or FFN
18:44 they basically have very different
18:46 computational profile, right?
18:47 And this independent scanning allows each module
18:51 to maximize GPU utilization.
18:53 For example, attention replicas
18:54 can accumulate a larger batch
18:56 and FFN expert can run on smaller batches
18:58 more efficiently.
18:59 And why one very earlier prototype
19:03 is called a mega scale infer.
19:04 I think it's a system developed by Baddance,
19:06 which I post the link below.
19:08 Introduces idea for MOE models
19:11 and they claim to achieve up to 1.9x
19:15 higher per GPU throughput
19:17 than CRR solutions.
19:21 Here I want to use this slide
19:23 to give you one key insight
19:25 behind this AFD.
19:26 So at first glance,
19:27 especially when you first time
19:29 heard about AFD,
19:30 it looks very scary, right?
19:32 Because I think PD deserrogation
19:34 already sounds scary
19:35 because we put the two phases
19:37 into two independent pools resources
19:40 and we basically incur
19:42 some additional communication.
19:44 But I think during the development of
19:46 2019 file,
19:46 we have shown that this communication
19:48 is definitely addressable, right?
19:49 We can hide this communication
19:50 using many, many different measures.
19:52 But this AFD actually looks more scary.
19:54 Why?
19:54 Because we are adding per layer
19:56 communication between attention and FFN.
19:58 Remember in PD deserrogation
20:00 the communication is only one shot.
20:02 It's between preview and decode.
20:04 But here we are adding a per layer
20:06 which is very scary.
20:07 And it sounds like
20:09 it will kill her latency.
20:09 But the key inside here is in MOE
20:14 we already pay for an auto
20:17 or an issue layer.
20:18 That's because we introduced
20:20 the expert partizm.
20:21 Because tokens must be redistributed
20:23 when switching between
20:24 the data partizm in attention
20:26 to expert partizm in MOE.
20:28 So the key inside here
20:29 is this AFD communication
20:31 which is per layer.
20:32 It can be fused with
20:35 that existing MOE auto.
20:37 So basically,
20:39 we move activation
20:40 and we do expert routing
20:41 in one fused communication connected.
20:43 So as long as we
20:44 the fused communication
20:45 is a lot slower
20:46 than the original auto,
20:47 then AFD is basically free, right?
20:50 From a latency perspective.
20:51 That's why I think AFD
20:53 definitely will take off,
20:54 especially in the context
20:56 of MOE models
20:57 because indeed
20:58 we can also find ways
21:01 to hide the communication
21:02 but we can still enjoy
21:03 the benefit of AFD.
21:04 And once this is true,
21:07 we get all the upside, right?
21:08 For example,
21:09 we have resource decoupling.
21:10 We have independent scaling
21:12 of attention and FFN
21:14 and we have the ability
21:15 to even use heterogeneous
21:16 hardware for attention
21:18 and for MOE or FFNs, okay?
21:22 And running attention
21:24 and FFN on different GPUs
21:26 also probably introduced
21:27 some idle periods
21:29 unless we carefully orchestrate them.
21:31 And there's a very innovative idea
21:34 that we can address
21:35 this kind of like a communication
21:37 overhead or hide this company
21:39 overhead with this ping-pong pipeline
21:41 partizm.
21:42 So it basically partitions
21:43 the batch
21:44 into many micro-batches
21:45 and shuffles them back and forth
21:47 between attention and FFN
21:49 and overlapping communication
21:50 with computation.
21:51 So when attention finishes
21:52 processing one micro-batch,
21:54 it passes the hidden state
21:56 to an FFN or MOE GPU
21:58 and starts on the next micro-batch.
22:00 And meanwhile,
22:01 the FFN GPU processes
22:04 the previous micro-batch
22:04 and the returns output
22:06 to the next layers of attention.
22:09 And we can basically schedule
22:10 this micro-batches
22:11 computation and communication
22:13 on different GPU streams
22:14 to form this kind of
22:15 so-called ping-pong pipeline paradigm.
22:17 And this actually requires
22:18 a lot of deep diving
22:19 into lower level
22:21 like GPU programming
22:22 which I think is really,
22:23 really nice today
22:24 because people seem to be
22:25 very open to this nowadays
22:26 compared to like let's say
22:27 two or three years ago.
22:28 And this is a very powerful example
22:31 of how system design
22:33 and the model architecture
22:34 must co-evolve, right?
22:36 OK, that is basically a trend one.
22:39 I think AFD is really
22:40 something interesting.
22:41 And in my lab,
22:42 we are also doing
22:43 some sort of AFD
22:44 but we also apply this AFD idea
22:46 in training
22:47 to improve the training structure.
22:49 OK, so far,
22:51 I have to talk about
22:52 language model inference
22:53 and apparently
22:54 the next big shift
22:55 is video diffusion.
22:56 So on the left,
22:57 you can see that
22:58 almost every major AI lab
23:00 now has a competitive
23:01 video model
23:02 and the quality gap
23:03 is shrinking fast.
23:05 And I want you to pay attention
23:06 to this column,
23:07 API pricing.
23:08 And I think this price
23:11 is a little bit crazy
23:12 if you compare this
23:13 to language model inference, right?
23:14 I think that today
23:14 language model inference
23:16 token is almost free
23:16 in my opinion.
23:18 But if you look at
23:19 this video generation cost,
23:20 every one minute video
23:22 is generated,
23:23 you need to roughly pay
23:24 like $10 in average.
23:25 OK?
23:26 On the right,
23:27 the market growth
23:28 tells the same story.
23:30 So video generation demand
23:31 is wrapping up aggressively
23:33 and from a system perspective,
23:35 this is a big deal, right?
23:36 Because video diffusion
23:37 is much higher than text.
23:39 It has longer sequence,
23:40 it has more attention
23:41 and it has a far higher
23:43 compute request.
23:44 And this basically makes
23:45 video diffusion
23:46 a perfect strict test
23:47 for everything we just discussed.
23:48 For example,
23:49 kernel optimization,
23:50 compilers,
23:51 deserrogation,
23:52 whatever, right?
23:53 And let me walk through
23:56 a very, very quick example
23:57 to highlight the problem.
23:58 I will focus on the problem
23:59 but not the solutions.
24:00 OK?
24:01 So let me walk through
24:03 the video DIT architecture
24:05 and explain why inference
24:06 is slow.
24:07 So here on the left,
24:10 the input video
24:11 is the first passed
24:12 through a VEE encoder.
24:15 By the way, this is a typical
24:16 like a video DIT architecture
24:17 and this represents
24:19 WAN,
24:20 which is a very, very popular
24:21 open source model.
24:22 On the left,
24:23 the input video
24:24 is the first passed
24:25 through a VEE encoder
24:26 which compresses the video
24:27 into a long latent sequence.
24:29 the latent sequence
24:30 then goes into the core
24:31 diffusion model
24:32 which is here,
24:33 which is the stack
24:34 of NDIT blocks
24:35 and still
24:36 is transferred
24:37 layers
24:39 but it is conditional
24:40 on text
24:41 through something like
24:42 a text input
24:43 and of course attention.
24:44 The key difference
24:45 from a language model
24:46 is that this is
24:47 the diffusion process.
24:49 We don't just run
24:49 the model once.
24:51 We run the same DIT stack
24:52 many, many times
24:52 across diffusion time steps
24:55 which immediately
24:56 multiplies the cost.
24:57 Right?
24:58 And on top of that,
24:59 we also pay for the
25:01 VAE encoding
25:02 and decoding
25:03 and this VAE encoder
25:04 and decoder
25:05 has very different
25:06 kernel profiles
25:07 compared to transformers.
25:08 So that poses another layer
25:09 of like a kernel
25:10 optimization problems, right?
25:12 And at the beginning
25:13 and the end
25:14 which is extra overhead
25:15 autosatural transformer.
25:16 And finally,
25:17 the attention pattern
25:18 is also very special.
25:19 So one key message
25:20 I want to handle
25:21 is in video DIT
25:22 the attention
25:23 has extremely long sequences.
25:26 It has heavy self-attention
25:27 and also it has
25:28 cross-attention to text.
25:29 So in many ways
25:30 this already looks like
25:31 a language model inference
25:32 workload
25:33 but with longer sequence
25:34 and repeated
25:35 dozens of times, right?
25:36 Which is why
25:37 video DIT is so expensive.
25:40 And here I want
25:41 to highlight one example
25:42 that is the attention
25:44 because that's where
25:45 most of the costs
25:46 really come from.
25:48 So let's look at
25:49 a real open source example.
25:50 This is the
25:51 Hunyuan video.
25:52 It basically generates
25:53 a five second video
25:54 at a 720p
25:55 using a 14b DIT model.
25:57 13b, sorry.
25:59 Even after applying
26:01 VAE
26:02 the latent shape
26:03 it's still
26:04 30 by 48 by 80.
26:06 Okay?
26:07 If you enroll this number
26:08 it's basically a sequence
26:09 length of
26:11 115k tokens.
26:12 Okay?
26:13 And this is already,
26:14 this is only a five second
26:15 video by the way.
26:16 And then now
26:17 here's the crazy part, right?
26:18 And generating
26:19 this five second clip
26:20 takes roughly
26:21 16 minutes
26:22 on a single H100.
26:24 and it's basically
26:25 a real bottleneck
26:26 for practical use.
26:27 So what's going on here?
26:28 So if you look at
26:29 the flops breakdown
26:30 on the right,
26:31 so basically attention
26:32 dominates the computation
26:33 and over 80%
26:34 of the total time
26:35 is spent basically
26:36 on just doing attention.
26:37 That's because
26:38 attention grows
26:39 quadratically
26:40 with sequence length, right?
26:41 So in video DIT
26:43 indeed, people use
26:44 a sort of
26:45 3D attention.
26:46 That is,
26:47 they enroll this like
26:48 height, width and time
26:49 into a long sequence
26:50 and they perform
26:51 like a quadratically
26:52 full attention across
26:53 3D sequence.
26:54 And this is something
26:55 that really
26:56 make this slow.
26:57 OK, that is a problem
26:59 and there are a few
27:00 other problems.
27:01 For example, I said
27:02 a division process
27:03 need to run
27:04 like a 50 steps
27:06 or even 100 steps
27:07 sometimes.
27:08 And you also have
27:09 the problem of handling
27:10 this weighing
27:11 coding decoder,
27:12 which is arguably
27:13 also very heavy.
27:14 OK, here I want
27:16 to quickly highlight
27:17 the system we are building.
27:18 It's a new system
27:19 of fast video
27:20 and our goal
27:21 is pretty simple.
27:22 We want to build
27:23 a simple and efficient
27:24 the post training
27:25 and the inference engine
27:26 for video generation models.
27:27 And we are roughly
27:28 at this stage
27:29 we are roughly optimizing
27:30 everything from
27:31 attention kernels
27:32 to the memory layout
27:33 to the system architecture
27:35 to make video diffusion
27:36 not just powerful
27:37 but also practical.
27:38 Both for research
27:39 and real world development.
27:40 And of course,
27:42 like I said,
27:43 I think the video
27:44 diffusion is a very
27:45 very exciting field.
27:46 And you probably
27:47 also heard
27:48 that we are collaborating
27:49 with VRM,
27:50 OMNI and also
27:51 SGLAN diffusion
27:53 to basically push
27:54 video diffusion
27:55 into the language model.
27:56 OK,
27:57 a language model
27:58 in present engine.
28:00 And here's an example
28:01 of things we do.
28:02 We want to push for real-time
28:03 video generation
28:04 in 2026.
28:05 And we have done it
28:06 for 480p video
28:08 on 1.3p 3 model.
28:09 But we want to push it
28:10 further.
28:11 We want to do 1080p.
28:12 We want to do 4k.
28:13 And we want to do it
28:14 on much larger models.
28:15 OK,
28:17 that pretty much wraps my talk
28:19 and I think I highlight
28:20 two trends.
28:21 One is the AFD,
28:22 right?
28:23 Which is a very very
28:25 promising technically
28:26 in MOE models,
28:27 given that MOE model
28:28 become very dominant today.
28:29 Secondly is a diffusion.
28:30 I think diffusion
28:31 and language model,
28:32 now this sounds very separate,
28:34 but eventually they will be
28:36 properly dumped into one model.
28:38 For example,
28:39 nanoblana and this kind of architecture
28:41 actually proves that.
28:42 And there are a few other trends,
28:44 but I hope this is helpful for you.
28:46 for for you. Okay, thank you.