WF FPGA Ideas: Difference between revisions

From F256 Foenix
Jump to navigationJump to search
(Created page with "== Global settings == CRT emulation, only for low resolution layers. 640x480 for 4:3 output, or 960x540 for 16:9 output, if bandwidth can run it. Non-integer pixel aspect flags, again only for low res layers. Match 320x200 and 256x200 non-square aspects blended on a 480p/540p base output. Keep it at 60Hz Select 50 or 60 Hz in any resolution. Ditch 70Hz, as nothing syncs to that in the PC space for compatibility. 50Hz is much lower priority, but can be done by extendin...")
 
mNo edit summary
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Global settings ==
== Video enhancements ==
CRT emulation, only for low resolution layers.
[[WF16 Video Architecture]]


640x480 for 4:3 output, or 960x540 for 16:9 output, if bandwidth can run it.
== Math coprocessor ==
[[FPU Accumulator]]


Non-integer pixel aspect flags, again only for low res layers. Match 320x200 and 256x200 non-square aspects blended on a 480p/540p base output. Keep it at 60Hz
== SDCard ==
Auto-tx on read. Could supersede this with auto-reading a 16-bit length to a storage pointer, running in the background, flag or interrupt when done. Loading straight into DDR3 would be good once audio/video can read from there.


Select 50 or 60 Hz in any resolution. Ditch 70Hz, as nothing syncs to that in the PC space for compatibility. 50Hz is much lower priority, but can be done by extending vblank time and keeping pixel clock the same. If 640x400 res is still used, have it at 60Hz, again same pixel clock but longer vblank.
Stream MP3 or MIDI file from disk (or ddr3?) straight to chips


== Palettes ==
== Bootstrap and Machine Identity ==
Reduce from 24-bit to 16-bit, better suited to 65816, and makes a lot of things simpler.
Access to the cores SDCard. Name something better than the current hardcoded names.


5-5-5-1 masked, or 4-4-4-4 RGBA? Leaning towards the latter. Have a FPGA block which separates & combines 4 values into 1, all R/W registers, avoid all the shifting. Using transparency 0=opaque, 15=fully transparent is probably easier.
Soft-boot into one of the other cores, instead of relying on jumpers to select the core. Have an extended PGZ header or new file format that can request such things. Note that PGZ should say how many and which 8k blocks or 64k banks it's hardcoded for. Should be another load format for "play nice" system tools that can be simultaneously loaded.


Palettes are always 4-4-4-4, but direct color 5-5-5-1 or 4-4-4-4 can be used for bitmap layers?
Identify cores by an 8 character ASCII string, instead of bit fields. Maybe have a version number (16 bit) per release of any named platform.


== Layers ==
Have different ROM loads for different cores? Or boot from RAM if some magic bytes occur there instead of ROM., though that would be part of the ROM boot that can do that, what ISA would the code be?
Every layer def has these:


* Type
== Wavetable Audio ==
* Base pointer (could be page-aligned 16-bit, for 16MB range?, else 32-bit pointer)
Some form of wavetable audio is sorely missing for Amiga, TG16, SNES, Soundblaster era audio, especially sound effects. Basic multiple channels of stereo or panned mono, uncompressed samples, support some simple compression formats.
* x/y pixel scroll (16-bit wrapping). Could share these, but not a big deal to duplicate these
* CLUT selection
* Bit depth?
* Clip window?
* High or low resolution? Some modes and bit depths are low enough bandwidth to doat 480p
* CRT emulation? only for low resolution layers. Probably a global setting, not per layer.
* Non-integer pixel aspect flags? to match 320x200 narrow pixels or 256x200 wide pixels. Probably global setting


=== No-overdraw bandwidth reduction ===
Pull small buffers of audio into the FPGA from RAM, keep 2 live at a time, one playing, one buffered. At 25MHz with a 48KHz sample rate, 1 sample lasts 520 cycles (20µsec), could be fast enough to fetch while the last sample is playing and just single-buffer it? Or even a rolling window Should also support just-in-time software generation of each buffer, with IRQ notifications.
Have multiple hardware instances of layer renderers, all fighting for external bandwidth. Render front-to-back, with transparent pixels causing the next layer underneath to want to draw that pixel. Each layer only requests individual pixels that it needs to draw, and keeps some cache for redrawing the same sprite/tile on the same line. The first layer gets The only wasted reads are when a read pixel is 0 and dispatches deeper, and 16-bit wide reads where not all the bits are used.


=== Alpha Transparency ===
Stretch samples to whatever the output sampling rate is. Linear interpolation would be neat, but probably optional. Same with the decay-to-zero that many old synth chips had.
Considering SNES-style 2nd pixel buffer. Either a layer could be sent to it, or some colors within one can be shunted to it while sending an empty pixel to the next layer down. This is done after front-to-back masking, so the layer stack can obscure transparent pixels, and pixels underneath that are still sought for render. There is only 1 transparent pixel buffer, but maybe if another transparent one is found (which would be lower) it could be blended in instead of replacing it. Depends on timing. Ignoring further down transparency would be a reasonable default.


The transparency level is probably determined per layer, and that is stored per pixel in the 2nd line buffer. only 2-4 bpp for transparency would be fine for blending.
== Sound Chip Instances ==
Instead of duplicating hardware instances of sound chips, multiplex all their registers and internal variables. Access would select 1 of them, and the hardware would run with that multiplexer active on all its values. Compare the size taken, depends on how complex the chip is.


Individual layers, sprites, can have a transparency override, ignoring the palette value. If they don't, transparency from the palette per color can still be obeyed.
Since sound chips don't need to be that fast, serially looping through and executing a cycle, accumulating their output, for the final audio sample would be fine.


== Bit depth ==
Register to select which instance(s) to use, so programs can be agnostic to what sound context exists with other things. Probably expose 2 chips at a time through the IO registers, with a separate one to select which bank of 2 to use. At least for mono chips, or those which are commonly in pairs. Stereo chips could just have 1 register set exposed.
Currently, everything is 8bpp, which is high bandwidth, and more work to create artwork.


For tiles sprites, and bitmaps, choose 1/2/4/8 bpp. Direct color 16-bit bitmaps would be separate from paletted bitmaps.
== FPGA-based CPU ==
65816 but with a genuine 16-bit data bus


Each layer can select a CLUT. Each sprite or tile points to either a starting palette entry, or maybe a bitmask to OR into it.
== Bitstream readers/writers ==
Write a byte or word to a FPGA location, it takes a CPU cycle to write it, and bumps its pointer.


== Bitmaps ==
For a bit stream, a 32-bit bitpointer covers 4Gb = 512MB. A write would need to know the width to write. Maybe 16 registers, write a value to one of those to declare how many bits from the written value to write. This actually allows the index register to determine width dynamically, which is nice. Both read & write interfaces should use this. A complete hack but easier to use would be to have 18 regs. For any width >8 bits, a 2nd access to the next register would grab the high byte, even though it's technically the trigger for the next higher. But this would get confusing in 65816 16-bit mode, accessing lower lengths.
Option to wrap. Else, it shows blank pixels outside its range. Divmod can be done once per line.


== Tiles ==
Separate read & write context, so copies, decompression, etc, can be done. Bit pointers can be directly read/written as well. Direction is always in the positive direction, though, at least for now.
Option to wrap.


Option for tile 0 to be empty, without fetching the pixel data. Saves bandwidth.
Probably good to support 0-length, for dynamically computed lengths. So with 0-16 supported, that's 17 entry points, kinda messy.


'''Rougher ideas'''
Also, skip forward N bits without a read or write. Technically this could just be a read N bits and ignore the value, but this should be 0-65535 bits skipped.


Neo Geo has auto-animating tiles/sprites. 4 or 8 tiles in a row in the tileset can be cycled through for animation. (cycling through low 2 or 3 bits of tile index). Global or layer-specific config for how may frames per step.
8bit interface: 2 bitpointers, then 8 byte locs for pointer 0, and 8 byte locs for pointer 1.


== Sprites ==
16bit interface: 2 bitpointers, then 16 word locs for pointer 0, 16 word locs for pointer 1. Writes triggered on high byte write. Reads trigger on low byte read, which readies the high byte.
H-fliip at the very minimum. If V-flip and 90° (since all sprites/tiles are square sized), then all 8 orientations and flips are possible. Rotation is only available in SRAM.


16-bit sprite image selection from base pointer, based on bpp & size. Flip bits might be at MSBs of the word.
This is a CPU-blocking interface for reads, buffered for writes.


Color selection, direct for 1bpp, starting palette offset for 2/4bpp. Need to figure out something for 8bpp,
== RLE Format(s) ==
RLE layers, DMA, and potentially sprites can use RLE encoding.
{| class="wikitable"
|+Span-based RLE formats
!bpp
!Layout
!length
!Max compression
!Breakeven
|-
|1
|<code>clllllll</code>
|1-128
|16:1 byte
|8px
|-
|2
|<code>cc111111</code>
|1-64
|16:1 byte
|4 pixels
|-
|4
|<code>ccccllll</code>
|1-16
|4:1 byte
|2 pixels
|-
|4
|<code>ccccCCCC llllllll llllllll</code>
|1-256
|170:1 byte (512:3)
|3+3 pixels
|-
|8
|<code>cccccccc llllllll</code>
|1-256
|128:1 byte
|2 pixels
|}
However, it would be useful to have spans of literal pixels as well, instead of just solid color span fills.


8×8, 16×16, 32×32, 64×64 sizes (2 bit selection, forget 24x24)
<code>0lllllll cccccccc</code> = span length L of color C, 0 = transparent


1,2,4,8 bpp (2 bit selection)
<code>1lllllll cccccccc...</code>= L count of individual pixels


(or should bpp & size be for the layer? might make for simpler implementation, but varying sprite sizes are probably good. Bpp might still be a consideration for layer config
For a bpp less than 8, probably require them to fill an even byte or word count


'''Rougher ideas'''
For now, RLE layers should be simple length + 8bpp aligned words. RLE bitmaps would be something different, maybe it's too flexible so we should just leave that to the CPU. It would save a lot of bandwidth for bitmap overlay layers with large transparent windows, though.


Unlimited height sprites? fixed width
== DMA/Blitter ==
Maybe separate out 2d mode into its own blitter?


8bpp color register could be used to bank a subset of colors, maybe a color range (0-7) can be cycled while others are fixed. Think of Age of Empires 1 recoloring for instance.
Flag to mask out the 'fill/mask' color (default 0)


Select the color to be transparent? If using a fixed smaller palette, like DB16/32, then each sprite could pick a different one. Pico-8 has a 16-bbit mask for which colors to include or not, which is interesting.
Clip to output screen dimensions.


Cut-out sprites wouldn't display, but would clear any pixel from sprites above it, allowing sprites below to show through. Or, it could skip the sprite immediately below if it has a pixel, masking a single sprite, which might be easier to implement.
Xflip, yflip, maybe 90° rotation, but that means dest dimensions change? Scaling? Full affine transform?


Figure out sprite zooming. No rotation,just scaling, not inverting with this? Can grow or shrink independently in x & y. Maybe bresenham? Probably want sub-pixel accuracy, 16-bit with fixed point? Or full 32-bit fixed? x1/x2/y1/y2 dest rectangle maybe?
Unpack RLE graphics, for better memory usage.Could still do x/y flip because this isn't raster-dependent. Must know the total x/y though if clipping is supported


== RLE ==
Fields:
2 different modes, span, and pixel.
 
* bpp (could expand from src to dest given an offset?)
* src w/h/stride
* dest w/h/stride
 
'''TODO - V2'''
 
Clipping? Or should the src/dest be handled in software?
 
Ideally, there'd be a clip bounds defined at the dest address, w, h, stride, bpp. The source address is defined, and it's blitted into an x/y in the dest screen, automatically clipped. This could also be used as a pixel/stamp plotter.
 
If there end up being a large number of parameters for a src or dest, it would be nice to have multiple profiles. Either read src/dest profiles from ram, or have say 4 src & 4 dests saved, and blit from src N to dest M.
 
RLE graphics should probably save their w/h and mode implicitly as the first 2 words, as they are their own free-form shapes.
 
Since DMA currently only takes place during VBLANK, could be more efficient to have a DMA list to run when VSYNC hits, blasting those out as fast as possible.
 
For 1bpp (or maybe others, too?) and/or/nor/nand/xor modes would be necessary.

Latest revision as of 22:29, 5 January 2026

Video enhancements

WF16 Video Architecture

Math coprocessor

FPU Accumulator

SDCard

Auto-tx on read. Could supersede this with auto-reading a 16-bit length to a storage pointer, running in the background, flag or interrupt when done. Loading straight into DDR3 would be good once audio/video can read from there.

Stream MP3 or MIDI file from disk (or ddr3?) straight to chips

Bootstrap and Machine Identity

Access to the cores SDCard. Name something better than the current hardcoded names.

Soft-boot into one of the other cores, instead of relying on jumpers to select the core. Have an extended PGZ header or new file format that can request such things. Note that PGZ should say how many and which 8k blocks or 64k banks it's hardcoded for. Should be another load format for "play nice" system tools that can be simultaneously loaded.

Identify cores by an 8 character ASCII string, instead of bit fields. Maybe have a version number (16 bit) per release of any named platform.

Have different ROM loads for different cores? Or boot from RAM if some magic bytes occur there instead of ROM., though that would be part of the ROM boot that can do that, what ISA would the code be?

Wavetable Audio

Some form of wavetable audio is sorely missing for Amiga, TG16, SNES, Soundblaster era audio, especially sound effects. Basic multiple channels of stereo or panned mono, uncompressed samples, support some simple compression formats.

Pull small buffers of audio into the FPGA from RAM, keep 2 live at a time, one playing, one buffered. At 25MHz with a 48KHz sample rate, 1 sample lasts 520 cycles (20µsec), could be fast enough to fetch while the last sample is playing and just single-buffer it? Or even a rolling window Should also support just-in-time software generation of each buffer, with IRQ notifications.

Stretch samples to whatever the output sampling rate is. Linear interpolation would be neat, but probably optional. Same with the decay-to-zero that many old synth chips had.

Sound Chip Instances

Instead of duplicating hardware instances of sound chips, multiplex all their registers and internal variables. Access would select 1 of them, and the hardware would run with that multiplexer active on all its values. Compare the size taken, depends on how complex the chip is.

Since sound chips don't need to be that fast, serially looping through and executing a cycle, accumulating their output, for the final audio sample would be fine.

Register to select which instance(s) to use, so programs can be agnostic to what sound context exists with other things. Probably expose 2 chips at a time through the IO registers, with a separate one to select which bank of 2 to use. At least for mono chips, or those which are commonly in pairs. Stereo chips could just have 1 register set exposed.

FPGA-based CPU

65816 but with a genuine 16-bit data bus

Bitstream readers/writers

Write a byte or word to a FPGA location, it takes a CPU cycle to write it, and bumps its pointer.

For a bit stream, a 32-bit bitpointer covers 4Gb = 512MB. A write would need to know the width to write. Maybe 16 registers, write a value to one of those to declare how many bits from the written value to write. This actually allows the index register to determine width dynamically, which is nice. Both read & write interfaces should use this. A complete hack but easier to use would be to have 18 regs. For any width >8 bits, a 2nd access to the next register would grab the high byte, even though it's technically the trigger for the next higher. But this would get confusing in 65816 16-bit mode, accessing lower lengths.

Separate read & write context, so copies, decompression, etc, can be done. Bit pointers can be directly read/written as well. Direction is always in the positive direction, though, at least for now.

Probably good to support 0-length, for dynamically computed lengths. So with 0-16 supported, that's 17 entry points, kinda messy.

Also, skip forward N bits without a read or write. Technically this could just be a read N bits and ignore the value, but this should be 0-65535 bits skipped.

8bit interface: 2 bitpointers, then 8 byte locs for pointer 0, and 8 byte locs for pointer 1.

16bit interface: 2 bitpointers, then 16 word locs for pointer 0, 16 word locs for pointer 1. Writes triggered on high byte write. Reads trigger on low byte read, which readies the high byte.

This is a CPU-blocking interface for reads, buffered for writes.

RLE Format(s)

RLE layers, DMA, and potentially sprites can use RLE encoding.

Span-based RLE formats
bpp Layout length Max compression Breakeven
1 clllllll 1-128 16:1 byte 8px
2 cc111111 1-64 16:1 byte 4 pixels
4 ccccllll 1-16 4:1 byte 2 pixels
4 ccccCCCC llllllll llllllll 1-256 170:1 byte (512:3) 3+3 pixels
8 cccccccc llllllll 1-256 128:1 byte 2 pixels

However, it would be useful to have spans of literal pixels as well, instead of just solid color span fills.

0lllllll cccccccc = span length L of color C, 0 = transparent

1lllllll cccccccc...= L count of individual pixels

For a bpp less than 8, probably require them to fill an even byte or word count

For now, RLE layers should be simple length + 8bpp aligned words. RLE bitmaps would be something different, maybe it's too flexible so we should just leave that to the CPU. It would save a lot of bandwidth for bitmap overlay layers with large transparent windows, though.

DMA/Blitter

Maybe separate out 2d mode into its own blitter?

Flag to mask out the 'fill/mask' color (default 0)

Clip to output screen dimensions.

Xflip, yflip, maybe 90° rotation, but that means dest dimensions change? Scaling? Full affine transform?

Unpack RLE graphics, for better memory usage.Could still do x/y flip because this isn't raster-dependent. Must know the total x/y though if clipping is supported

Fields:

  • bpp (could expand from src to dest given an offset?)
  • src w/h/stride
  • dest w/h/stride

TODO - V2

Clipping? Or should the src/dest be handled in software?

Ideally, there'd be a clip bounds defined at the dest address, w, h, stride, bpp. The source address is defined, and it's blitted into an x/y in the dest screen, automatically clipped. This could also be used as a pixel/stamp plotter.

If there end up being a large number of parameters for a src or dest, it would be nice to have multiple profiles. Either read src/dest profiles from ram, or have say 4 src & 4 dests saved, and blit from src N to dest M.

RLE graphics should probably save their w/h and mode implicitly as the first 2 words, as they are their own free-form shapes.

Since DMA currently only takes place during VBLANK, could be more efficient to have a DMA list to run when VSYNC hits, blasting those out as fast as possible.

For 1bpp (or maybe others, too?) and/or/nor/nand/xor modes would be necessary.