Changes to add named fields and some other improvements #67

tombolano · 2024-04-18T19:29:01Z

tombolano
Apr 18, 2024
Collaborator

Hello, this is a great library, I found it some days ago because I wanted to use the Pandoc AST in Python and this library is exactly what I needed.

One thing that I like about Pandoc is the Lua filters interface, which allows accessing the fields of the elements by their names. Since this library lacked this feature I added it, but before that I also did a general review of the code and changed some other things. My changes are in https://github.com/tombolano/pandoc/tree/experimental. As a summary these are the most relevant changes:

Avoid creating temporary input and output files for communicating with pandoc, use stdin and stdout.
Fixed a bug that caused that the configure and make_types functions were called twice when loading the types module.
Modified the apply code to use a single function for the tree traversal. The typical post order recursive tree traversal function is like this (https://en.wikipedia.org/wiki/Tree_traversal#Post-order_implementation):
```
procedure postorder(node)
  if node = null
      return
  postorder(node.left)
  postorder(node.right)
  visit(node)
```
The current code does the post order tree traversal correctly, but it is split in two functions, _apply_children, and apply_ (inside apply). I found this code a bit confusing and I changed it to just use a single function similar to the one from the example above, called _apply_post_order.
Implemented concrete data types as dataclasses. This offers the following features:
- We can access the data fields by names, in a similar way as in Pandoc Lua filters, although note that the number and name of the fields may be different, i.e., sometimes the fields in the Lua filters API do not match directly the Pandoc Haskell types. For example a link element in the Haskell types is defined with the constructor Link !Attr ![Inline] !Target, but in the Pandoc Lua filters API it has fields content, target, title, and attr.
- The __init__, __repr__, and __eq__ methods are added automatically to the classes.
- The __match_args__ variable is created automatically.
- With python 3.10 or newer the pprint module can be used by default to pretty-print dataclasses, this way we can pretty-print the documents easily.
As a drawback, the implementation of the __getitem__ and __setitem__ methods now is a little more complicated since we cannot rely directly on a list.

Note that the changes do not affect the API of the library, it is still the same, so any previous code should work the same, the only difference is that now because the data types are implemented with dataclasses, when printing a type the field names are also printed. For example, consider this code from the examples in the documentation:

>>> import pandoc
>>> text = "Hello world!"
>>> doc = pandoc.read(text)
>>> doc
Pandoc(Meta({}), [Para([Str('Hello'), Space(), Str('world!')])])

Now with the changes the result is the following:

>>> import pandoc
>>> text = "Hello world!"
>>> doc = pandoc.read(text)
>>> doc
Pandoc(meta=Meta(table={}), blocks=[Para(content=[Str(text='Hello,'), Space(), Str(text='World!')])])

@boisgera, if you may be interested in some of these changes you may take a look at my commits, I made sure to explain everything in detail in the commit messages. If you are interested I can submit pull requests of the changes, or you may also pick and apply yourself any changes that you want.

boisgera · 2024-05-05T21:21:00Z

boisgera
May 5, 2024
Maintainer

Hi @tombolano,

Thanks for the feedback and sorry for the delay! I'm definitely very interested by everything you did, but unfortunately my free time has been scarce lately ... Anyway, I certainly hope that we can integrate most of what you did. I'll try to have at least a quick look at your code next week.

Just one question at this point : I was also interested in the constructor keyword arguments & named attributes at one stage (I do agree that it's nice to have!) and kinda lost interest when I saw that this information was absent from the Haskell types definition (and I didn't want to settle with a manual naming system that would become obsolete with every new version of the Pandoc types). Do you derive automatically the attribute names? Sometimes there is a sensible name to give (for example name attr the only attribute of type Attr), and sometimes it's pretty hard to get a reasonable name from the type (for example [([Inline]], [[Block]])]). How did you solve this conundrum?

Cheers,

Sébastien

1 reply

tombolano May 9, 2024
Collaborator Author

Hi @boisgera,

Thanks for your response. Yes, the field names are mostly derived automatically, this is done by the function get_data_fields in types.py, the function receives a decl parameter with the type declaration and returns the list of field names. I say mostly because there is a few special cases where the function may set the field name depending on the type (e.g., for the type Header the name for the Int field is set to level). The algorithm is actually not very complicated but it has to consider different cases:

First, the algorithm checks if the declaration defines a record type or a product type:

Record types: in Pandoc types 1.23.1 there are two of these types:
- Meta type:
  The type declaration in Python is: ['Meta', ['map', [['unMeta', ['map', ['Text', 'MetaValue']]]]]]
  
  Here there is a single field with name unMeta. The code has the special case that if the name found is unMeta then the field name is set to 'table' (because 'unMeta' is a very undescriptive name).
- Citation type:
  The type declaration in Python is: ['Citation', ['map', [['citationId', 'Text'], ['citationPrefix', ['list', ['Inline']]], ['citationSuffix', ['list', ['Inline']]], ['citationMode', 'CitationMode'], ['citationNoteNum', 'Int'], ['citationHash', 'Int']]]]
  
  Here the code takes the names citationId, citationPrefix, citationSuffix, citationMode, citationNoteNum, and citationHash, converts them to snake case, and removes the type prefix (citation), so for example citationNoteNum is converted to note_num.
Thus, the fields returned for these types are as:
- Meta: ['table']
- Citation: ['id', 'prefix', 'suffix', 'mode', 'note_num', 'hash']
Product types:

For each field of these types the code considers these cases:
- Fields defined with just a string:
  For example, the Code type in Haskell is declared as Code !Attr !Text
  In Python the declaration is ['Code', ['list', ['Attr', 'Text']]]
  
  In this case the code takes the string, converts it to snake case, and and removes the type prefix if exists. Thus, for the Code type the field names obtained are attr, and text.
  
  A special case is added for the case of Int or Double fields, for example:
  - RowSpan: ['RowSpan', ['list', ['Int']]]
  - ColWidth: ['ColWidth_', ['list', ['Double']]]
  In these cases the field name is set to value.
  
  And an additional special case is added for the case of the Int field of the Header type:
  
  Header: ['Header', ['list', ['Int', 'Attr', ['list', ['Inline']]]]]
  
  In this case the field name is set to level.
- Maybe fields:
  The only Maybe field in Pandoc types 1.23.1 is in the Caption type,
  Its declaration in Haskell is Caption !(Maybe ShortCaption) ![Block]
  
  Its declaration in Python is ['Caption', ['list', [['maybe', ['ShortCaption']], ['list', ['Block']]]]]
  
  For this case the code checks that ['maybe', ['ShortCaption']] is a list with 'maybe' as the first element and that the second element is just a list with a single string, the code takes that string and converts it to snake case to obtain the field name. Thus, in this case the field name is taken as short_caption.
- List fields:
  One example of list fields is in the Cite type,
  Its declaration in Haskell is Cite ![Citation] ![Inline]
  
  Its declaration in Python is ['Cite', ['list', [['list', ['Citation']], ['list', ['Inline']]]]]
  
  In this case there are two fields: ['list', ['Citation']], and ['list', ['Inline']]. The code checks that 'list' is the first element and that the second element is just a list with a single string. In the general case the code takes that string, converts it to snake case, and converts it to plural with a simple algorithm by the function to_plural.
  
  Thus for the first field the name is set to citations. The second name, following this algorithm, should be inlines. However, note the Pandoc Lua Filters API does not use the names inlines or blocks except for the Pandoc type or the Meta types (e.g., MetaBlocks and MetaInlines), in any other cases it just name the field as content. So, in order for the names to be coherent with the Pandoc Lua Filters API I added this check in the code:
```
type_name != "Pandoc"
and not type_name.startswith("Meta")
and (field == "Block" or field == "Inline")
```
  if the check passes then the field name is just set to content.
  
  Hence, for the Cite type the field names are set to ['citations', 'content'].
- Any other case:
  In any other case the field name is just set to content, for example the following types have a single field which will be named content:
  - DefinitionList type, declared in Haskell as DefinitionList ![([Inline], [[Block]])]
  - BulletList type, declared in Haskell as BulletList ![[Block]]

The final step of the algorithm is to check if there are duplicated field names, if there are then the algorithm renames them by appending '0', '1', etc, at the end. In Pandoc types 1.23.1 this happens only for the TableBody type, which is defined in Haskell as:

data TableBody = TableBody !Attr !RowHeadColumns ![Row] ![Row]

the resulting field names for this case are:

['attr', 'row_head_columns', 'rows1', 'rows2']

For Pandoc types 1.23.1, the complete list of field names generated for each type is the following:

AlignCenter         []
AlignDefault        []
AlignLeft           []
AlignRight          []
AuthorInText        []
BlockQuote          ['content']
BulletList          ['content']
Caption             ['short_caption', 'content']
Cell                ['attr', 'alignment', 'row_span', 'col_span', 'content']
Citation            ['id', 'prefix', 'suffix', 'mode', 'note_num', 'hash']
Cite                ['citations', 'content']
Code                ['attr', 'text']
CodeBlock           ['attr', 'text']
ColSpan             ['value']
ColWidthDefault     []
ColWidth_           ['value']
Decimal             []
DefaultDelim        []
DefaultStyle        []
DefinitionList      ['content']
DisplayMath         []
Div                 ['attr', 'content']
DoubleQuote         []
Emph                ['content']
Example             []
Figure              ['attr', 'caption', 'content']
Format              ['text']
Header              ['level', 'attr', 'content']
HorizontalRule      []
Image               ['attr', 'content', 'target']
InlineMath          []
LineBlock           ['content']
LineBreak           []
Link                ['attr', 'content', 'target']
LowerAlpha          []
LowerRoman          []
Math                ['type', 'text']
Meta                ['table']
MetaBlocks          ['blocks']
MetaBool            ['bool']
MetaInlines         ['inlines']
MetaList            ['meta_values']
MetaMap             ['content']
MetaString          ['text']
NormalCitation      []
Note                ['content']
OneParen            []
OrderedList         ['list_attributes', 'content']
Pandoc              ['meta', 'blocks']
Para                ['content']
Period              []
Plain               ['content']
Quoted              ['quote_type', 'content']
RawBlock            ['format', 'text']
RawInline           ['format', 'text']
Row                 ['attr', 'cells']
RowHeadColumns      ['value']
RowSpan             ['value']
SingleQuote         []
SmallCaps           ['content']
SoftBreak           []
Space               []
Span                ['attr', 'content']
Str                 ['text']
Strikeout           ['content']
Strong              ['content']
Subscript           ['content']
Superscript         ['content']
SuppressAuthor      []
Table               ['attr', 'caption', 'col_specs', 'head', 'bodies', 'foot']
TableBody           ['attr', 'row_head_columns', 'rows1', 'rows2']
TableFoot           ['attr', 'rows']
TableHead           ['attr', 'rows']
TwoParens           []
Underline           ['content']
UpperAlpha          []
UpperRoman          []

For obtaining this list of types and fields I just ran this code in my branch:

import pandoc

pandoc.configure(auto=True)

for type_str, type_val in sorted(pandoc.types._types_dict.items()):
    if issubclass(type_val, pandoc.types.Constructor):
        print(f"{type_str:<20}", type_val._fields, sep="")

I think the results are quite nice, in many cases they are the same or very close to the Pandoc Lua API. However, there are some differences because in the Pandoc Lua API the fields are as far as I know named manually, which introduces some incongruences, and in a few cases the Pandoc Lua API adds more fields to the types (i.e., for these cases the structure of the Lua types is not the same as the Haskell types). Here is a comparison table with some examples:

Note: the reference for the Pandoc Lua API types is in https://pandoc.org/lua-filters.html#type-image
Note that I do not consider the tag and t fields because they are just the name of the type, nor the attributes, classes, and identifier fields, because they are just aliases for the contents of the attr field.

type	Fields (Mine)	Fields (Pandoc Lua API)	Comments
Caption	`short_caption`, `content`	`short`, `long`	Mine has `short_caption` and `content`, Pandoc Lua API has `short` and `long`
Cell	`attr`, `alignment`, `row_span`, `col_span`, `content`	`attr`, `alignment`, `row_span`, `col_span`, `contents`	Mine has a `content` field, Pandoc Lua API has a `contents` field
Citation	`id`, `prefix`, `suffix`, `mode`, `note_num`, `hash`	`id`, `prefix`, `suffix`, `mode`, `note_num`, `hash`	The fields are the same
Cite	`citations`, `content`	`citations`, `content`	The fields are the same
Code	`attr`, `text`	`attr`, `text`	The fields are the same
CodeBlock	`attr`, `text`	`attr`, `text`	The fields are the same
Figure	`attr`, `caption`, `content`	`attr`, `caption`, `content`	The fields are the same
Image	`attr`, `content`, `target`	`attr`, `caption`, `src`, `title`	Here the Pandoc Lua API has more fields, it seems that Pandoc reestructures the Image type for the Lua API
Link	`attr`, `content`, `target`	`attr`, `content`, `target`, `title`	The Pandoc Lua API has an additional `title` field
Math	`type`, `text`	`mathtype`, `text`	Mine has a `type` field, Pandoc Lua API has a `mathtype` field
Plain	`content`	`content`	The fields are the same
Quoted	`quote_type`, `content`	`quotetype`, `content`	Mine has a `quote_type` field, Pandoc Lua API has a `quotetype` field
RawBlock	`format`, `text`	`format`, `text`	The fields are the same
RawInline	`format`, `text`	`format`, `text`	The fields are the same
Span	`attr`, `content`	`attr`, `content`	The fields are the same
Str	`text`	`text`	The fields are the same
Strong	`content`	`content`	The fields are the same
Table	`attr`, `caption`, `col_specs`, `head`, `bodies`, `foot`	`attr`, `caption`, `colspecs`, `head`, `bodies`, `foot`	Mine has a `col_specs` field, Pandoc Lua API has a `colspecs` field
TableBody	`attr`, `row_head_columns`, `rows1`, `rows2`	`attr`, `row_head_columns`, `head`, `body`	Mine has `rows1` and `rows2` fields, Pandoc Lua API has `head` and `body` fields

boisgera · 2024-05-12T19:07:45Z

boisgera
May 12, 2024
Maintainer

Well, this is a very well thought-out piece! Thanks for this very detailled answer, I see where you're coming from and definitely agree with your decision process (infer what can be inferred automatically, clean up the Lua filter naming inconsistencies, etc.)

Two details:

There is definitely a tension between the automation of the naming and the willingness to provide meaningful names. For Header, I would definitely agree that using level is worth it (even if it brakes the common rule) ; for the same reason, for TableBody I think I would have picked the descriptive names like the Lua filters (head and body instead of rows1 and rows2). What do you think?
wrt Meta. Agree 100% that unMeta sucks in our context. As far as I remember, it makes sense in Haskell because the language generated automatically accessors, so that unMeta is the function that unpacks a Meta object into its content. But I would suggest map instead of table (minor nitpick!) since Meta contains a Haskell map and map is already defined in pandoc.types (as a dict of course). Would that be ok with you?

I am definitely interested in reviewing this is detail. I can think of three related issues to consider at the moment:

Representation as Strings

I don't like very much the representation of pandoc items with keyword arguments by default. When

>>> import pandoc
>>> text = "Hello world!"
>>> doc = pandoc.read(text)

I think that I'd rather have

>>> doc
Pandoc(Meta({}), [Para(content=[Str(text='Hello,'), Space(), Str(text='World!')])])

instead of

>>> doc
Pandoc(meta=Meta(table={}), blocks=[Para(content=[Str(text='Hello,'), Space(), Str(text='World!')])])

but this is mostly a matter of taste. I guess that a mechanism like NumPy's printoptions / set_printoptions could allow the user to switch between both representations? (Note that by the same mechanims we could introduce some pretty-printing that would probably alleviate my distaste for named fields.)

On a more practical note: the current library tests rely on the examples used in the documentation that use the compact/positional representation. If we were to change the default to named arguments, all tests would failed ATM.

Type Discoverability

I really, really like that I can forget the details of the pandoc type hierarchy and find this info in my Python console:

>>> from pandoc.types import *
>>> Meta
Meta({Text: MetaValue})
>>> Pandoc
Pandoc(Meta, [Block])
>>> AlignCenter
AlignCenter()
>>> Attr
Attr = (Text, [Text], [(Text, Text)])
>>> Cell
Cell(Attr, Alignment, RowSpan, ColSpan, [Block])

To use named arguments with the same degree of convienence (or greater), this type representation must be adapted. For example, something like:

>>> Pandoc
Pandoc(meta: Meta, blocks: [Block])

(should it also be configurable?)

Default Values

Named constructor arguments open the way for default values. For example, I'd much rather have

>>> doc = Pandoc(blocks=blocks)

than

>>> doc = Pandoc(Meta({}), blocks)

I did not think very deeply of it but I guess that at least some of the cases would be no-brainers (for example make every list and every map empty by default?).

Your inputs on this are welcome! 🤗

1 reply

tombolano May 18, 2024
Collaborator Author

Many thanks for your response @boisgera, I respond to all your comments below.

There is definitely a tension between the automation of the naming and the willingness to provide meaningful names. For Header, I would definitely agree that using level is worth it (even if it brakes the common rule) ; for the same reason, for TableBody I think I would have picked the descriptive names like the Lua filters (head and body instead of rows1 and rows2). What do you think?

Yes, I agree. Actually I noticed the TableBody case after doing the code. Since it is the only remaining special case I also think it is good to add it.

wrt Meta. Agree 100% that unMeta sucks in our context. As far as I remember, it makes sense in Haskell because the language generated automatically accessors, so that unMeta is the function that unpacks a Meta object into its content. But I would suggest map instead of table (minor nitpick!) since Meta contains a Haskell map and map is already defined in pandoc.types (as a dict of course). Would that be ok with you?

Yes, map is a good name.

I have updated my branch of the code with the previous changes.

Representation as Strings

I don't like very much the representation of pandoc items with keyword arguments by default. When
>>> import pandoc
>>> text = "Hello world!"
>>> doc = pandoc.read(text)
I think that I'd rather have
>>> doc
Pandoc(Meta({}), [Para(content=[Str(text='Hello,'), Space(), Str(text='World!')])])
instead of
>>> doc
Pandoc(meta=Meta(table={}), blocks=[Para(content=[Str(text='Hello,'), Space(), Str(text='World!')])])
but this is mostly a matter of taste. I guess that a mechanism like NumPy's printoptions / set_printoptions could allow the user to switch between both representations? (Note that by the same mechanims we could introduce some pretty-printing that would probably alleviate my distaste for named fields.)

On a more practical note: the current library tests rely on the examples used in the documentation that use the compact/positional representation. If we were to change the default to named arguments, all tests would failed ATM.

Thanks, I was aware that the tests were relying on the positional representation, so yes, I think that indeed a mechanism to switch between representations is needed.

For pretty-printing the best option may be to just use the rich library: https://github.com/Textualize/rich. By default this library already pretty-prints dataclasses, and if we want to print the dataclasses without the keywords it only requires to add a very simple __rich__repr__ method to the class (the documentation about this is here: https://rich.readthedocs.io/en/latest/pretty.html#rich-repr-protocol).

I have updated my branch of the code to implement this, the changes are actually quite simple:

When creating the types I do setattr(c, "_default_repr", c.__repr__) to save the __repr__ method to another attribute.
I added a print_options method to set the printing options, this method accepts these parameters:
- show_fields: a bool to set if we want to show the fields or not, the default is False
- types: the types to which we want to apply the options, or None (the default) to apply it to all.
If show_fields == False, __repr__ and __rich_repr__ are set to the adequate methods to not print the fields names. If show_fields == True, the __rich_repr__ method is deleted and the __repr__ method is set to the value of the _default_repr attribute (the original __repr__).

To not show the field names by default I also added a call to print_options in the configure function.

Here is a code that shows it working:

import rich
import pprint
import pandoc

doc = pandoc.read("Lorem ipsum dolor sit amet, consectetur adipiscing elit")

for i in range(3):
    print("-"*40)
    if i == 0:
        print("Not printing fields (the default)")
    elif i == 1:
        pandoc.types.print_options(show_fields=True)
        print("Setting printing fields")
    elif i == 2:
        pandoc.types.print_options(show_fields=False)
        print("Setting not printing fields")
    print()

    print("print:")
    print(doc)
    print()

    print("pprint:")
    pprint.pp(doc)
    print()

    print("rich:")
    rich.print(doc)

The output is the following (note that in the terminal the ouput of the rich module is highlighted with colors, which are not shown here):

----------------------------------------
Not printing fields (the default)

print:
Pandoc(Meta({}), [Para([Str('Lorem'), Space(), Str('ipsum'), Space(), Str('dolor'), Space(), Str('sit'), Space(), Str('amet,'), Space(), Str('consectetur'), Space(), Str('adipiscing'), Space(), Str('elit')])])

pprint:
Pandoc(Meta({}), [Para([Str('Lorem'), Space(), Str('ipsum'), Space(), Str('dolor'), Space(), Str('sit'), Space(), Str('amet,'), Space(), Str('consectetur'), Space(), Str('adipiscing'), Space(), Str('elit')])])

rich:
Pandoc(
    Meta({}),
    [
        Para(
            [
                Str('Lorem'),
                Space(),
                Str('ipsum'),
                Space(),
                Str('dolor'),
                Space(),
                Str('sit'),
                Space(),
                Str('amet,'),
                Space(),
                Str('consectetur'),
                Space(),
                Str('adipiscing'),
                Space(),
                Str('elit')
            ]
        )
    ]
)
----------------------------------------
Setting printing fields

print:
Pandoc(meta=Meta(map={}), blocks=[Para(content=[Str(text='Lorem'), Space(), Str(text='ipsum'), Space(), Str(text='dolor'), Space(), Str(text='sit'), Space(), Str(text='amet,'), Space(), Str(text='consectetur'), Space(), Str(text='adipiscing'), Space(), Str(text='elit')])])

pprint:
Pandoc(meta=Meta(map={}),
       blocks=[Para(content=[Str(text='Lorem'),
                             Space(),
                             Str(text='ipsum'),
                             Space(),
                             Str(text='dolor'),
                             Space(),
                             Str(text='sit'),
                             Space(),
                             Str(text='amet,'),
                             Space(),
                             Str(text='consectetur'),
                             Space(),
                             Str(text='adipiscing'),
                             Space(),
                             Str(text='elit')])])

rich:
Pandoc(
    meta=Meta(map={}),
    blocks=[
        Para(
            content=[
                Str(text='Lorem'),
                Space(),
                Str(text='ipsum'),
                Space(),
                Str(text='dolor'),
                Space(),
                Str(text='sit'),
                Space(),
                Str(text='amet,'),
                Space(),
                Str(text='consectetur'),
                Space(),
                Str(text='adipiscing'),
                Space(),
                Str(text='elit')
            ]
        )
    ]
)
----------------------------------------
Setting not printing fields

print:
Pandoc(Meta({}), [Para([Str('Lorem'), Space(), Str('ipsum'), Space(), Str('dolor'), Space(), Str('sit'), Space(), Str('amet,'), Space(), Str('consectetur'), Space(), Str('adipiscing'), Space(), Str('elit')])])

pprint:
Pandoc(Meta({}), [Para([Str('Lorem'), Space(), Str('ipsum'), Space(), Str('dolor'), Space(), Str('sit'), Space(), Str('amet,'), Space(), Str('consectetur'), Space(), Str('adipiscing'), Space(), Str('elit')])])

rich:
Pandoc(
    Meta({}),
    [
        Para(
            [
                Str('Lorem'),
                Space(),
                Str('ipsum'),
                Space(),
                Str('dolor'),
                Space(),
                Str('sit'),
                Space(),
                Str('amet,'),
                Space(),
                Str('consectetur'),
                Space(),
                Str('adipiscing'),
                Space(),
                Str('elit')
            ]
        )
    ]
)

Type Discoverability

I really, really like that I can forget the details of the pandoc type hierarchy and find this info in my Python console:
>>> from pandoc.types import *
>>> Meta
Meta({Text: MetaValue})
>>> Pandoc
Pandoc(Meta, [Block])
>>> AlignCenter
AlignCenter()
>>> Attr
Attr = (Text, [Text], [(Text, Text)])
>>> Cell
Cell(Attr, Alignment, RowSpan, ColSpan, [Block])
To use named arguments with the same degree of convienence (or greater), this type representation must be adapted. For example, something like:
>>> Pandoc
Pandoc(meta: Meta, blocks: [Block])
(should it also be configurable?)

Yes, this is also important, I have checked and it is also quite easy to implement. I have already made the changes and uploaded them to my branch. The changes are these:

I updated the docstring function of pandoc.utils adding a constructors_fields parameter and updating the code adequately to include the field names for the constructors.

In print_options I added the parameter show_type_fields (by default True) and the following code:

if show_type_fields:
    t.__doc__ = pandoc.utils.docstring(t._def, t._fields)
else:
    t.__doc__ = pandoc.utils.docstring(t._def)

Thus, these fields are also configurable.

Here is a code that shows it working:

import pandoc.types

print(pandoc.types.Attr)
print(pandoc.types.Cell)
print(pandoc.types.Citation)
print(pandoc.types.Meta)
print(pandoc.types.Pandoc)

pandoc.types.print_options(show_type_fields=False)
print()

print(pandoc.types.Attr)
print(pandoc.types.Cell)
print(pandoc.types.Citation)
print(pandoc.types.Meta)
print(pandoc.types.Pandoc)

The ouput is the following:

Attr = (Text, [Text], [(Text, Text)])
Cell(attr: Attr, alignment: Alignment, row_span: RowSpan, col_span: ColSpan, content: [Block])
Citation(id: Text, prefix: [Inline], suffix: [Inline], mode: CitationMode, note_num: Int, hash: Int)
Meta(map: {Text: MetaValue})
Pandoc(meta: Meta, blocks: [Block])

Attr = (Text, [Text], [(Text, Text)])
Cell(Attr, Alignment, RowSpan, ColSpan, [Block])
Citation(Text, [Inline], [Inline], CitationMode, Int, Int)
Meta({Text: MetaValue})
Pandoc(Meta, [Block])

Default Values

Named constructor arguments open the way for default values. For example, I'd much rather have
>>> doc = Pandoc(blocks=blocks)
than
>>> doc = Pandoc(Meta({}), blocks)
I did not think very deeply of it but I guess that at least some of the cases would be no-brainers (for example make every list and every map empty by default?).

Yes, this would be really nice. The code in my branch was setting None as the default value for all fields:

>>> import pandoc.types
>>> pandoc.types.Pandoc()
Pandoc(None, None)

which is not that useful.

Note that to implement this feature all fields should have a default value, because in Python non-default arguments cannot follow default arguments. For example:

@dataclass
class my_class:
    a: str = 'a'
    b: str

will return this error:

TypeError: non-default argument 'b' follows default argument

And this:

def my_func(a = 'a', b):
    pass

will return this error:

SyntaxError: parameter without a default follows parameter with a default

I took a look at this and I managed to implement it in my branch. The implementation is not so difficult, this is what I did:

I updated the _get_data_field function so that instead of just returning a list of strings, it returns for each field the name and its type (using a new class Field), for example:

For Header the function returns:

[
    Field(name='level', type='Int'),
    Field(name='attr', type='Attr'),
    Field(name='content', type=['list', ['Inline']])
]

For Cell the function now returns:

[
    Field(name='attr', type='Attr'),
    Field(name='alignment', type='Alignment'),
    Field(name='row_span', type='RowSpan'),
    Field(name='col_span', type='ColSpan'),
    Field(name='content', type=['list', ['Block']])
]

And for OrderedList the function returns:

[
    Field(name='list_attributes', type='ListAttributes'),
    Field(name='content', type=['list', [['list', ['Block']]]])
]

I added a new function _get_default_value, which receives the parameter type_def with the type definition (this would be the type field of the Field objects as shown above) and returns the default value for the type. The function works as follows:
- If the type definition is a string, the class is retrieved from the string (this is just type = globals()[type_def]) and the following checks are made:
  1. The type is a subclass of TypeDef:
    
    In this case the definition of the type is in type._def[1][1]. The returned value is then _get_default_value(type._def[1][1]). An example of this case is Attr, which in python is defined as:
```
['type', ['Attr', ['tuple', ['Text', ['list', ['Text']], ['list', [['tuple', ['Text', 'Text']]]]]]]]
```
    the returned value would be:
```
('', [], [])
```
  2. The type is a subclass of Constructor: The returned value is just an instance without specifying any parameter, i.e., type().
  3. The type is a subclass of Data:
    
    Here we have to return an instance of the data type. For example consider the case of Cell, which is defined in Haskell as:
```
data Cell = Cell !Attr !Alignment !RowSpan !ColSpan ![Block]
```
    The second parameter is an Alignment which is defined as:
```
data Alignment = AlignLeft | AlignRight | AlignCenter | AlignDefault
```
    So to instantiate a Cell with default values we must select a default value for the Alignment. For this the function searches the subclasses of the type and if it finds one containing the string "default" it returns an instance of that class, else it returns an instance of the first one. So for the example of Cell the returned value would be:
```
Cell(('', [], []), AlignDefault(), RowSpan(0), ColSpan(0), [])
```
  4. In other case the type should be a built-in type, i.e., Bool, Int, Text, etc. For these types we can call them directly to obtain a default value, thus in this case the returned value is just type().
- If the type definition is a list, then the first element is checked, the possible values are:
  - maybe: the function returns None
  - map: the function returns {}
  - list: the function returns []
  - tuple: the function returns tuple(_get_default_value(t) for t in type_def[1])
I updated the _make_constructor_class to add default values to the types, for that I used the dataclasses.field function setting the default_factory parameter to a lambda calling _get_default_value.

With the previous changes now the example that you proposed works:

>>> import pandoc.types
>>> pandoc.types.Pandoc(blocks=[])
Pandoc(Meta({}), [])

The following is the full list of Constructor types with their default values:

AlignCenter()
AlignDefault()
AlignLeft()
AlignRight()
AuthorInText()
BlockQuote([])
BulletList([])
Caption(None, [])
Cell(('', [], []), AlignDefault(), RowSpan(0), ColSpan(0), [])
Citation('', [], [], AuthorInText(), 0, 0)
Cite([], [])
Code(('', [], []), '')
CodeBlock(('', [], []), '')
ColSpan(0)
ColWidthDefault()
ColWidth_(0.0)
Decimal()
DefaultDelim()
DefaultStyle()
DefinitionList([])
DisplayMath()
Div(('', [], []), [])
DoubleQuote()
Emph([])
Example()
Figure(('', [], []), Caption(None, []), [])
Format('')
Header(0, ('', [], []), [])
HorizontalRule()
Image(('', [], []), [], ('', ''))
InlineMath()
LineBlock([])
LineBreak()
Link(('', [], []), [], ('', ''))
LowerAlpha()
LowerRoman()
Math(DisplayMath(), '')
Meta({})
MetaBlocks([])
MetaBool(False)
MetaInlines([])
MetaList([])
MetaMap({})
MetaString('')
NormalCitation()
Note([])
OneParen()
OrderedList((0, DefaultStyle(), DefaultDelim()), [])
Pandoc(Meta({}), [])
Para([])
Period()
Plain([])
Quoted(SingleQuote(), [])
RawBlock(Format(''), '')
RawInline(Format(''), '')
Row(('', [], []), [])
RowHeadColumns(0)
RowSpan(0)
SingleQuote()
SmallCaps([])
SoftBreak()
Space()
Span(('', [], []), [])
Str('')
Strikeout([])
Strong([])
Subscript([])
Superscript([])
SuppressAuthor()
Table(('', [], []), Caption(None, []), [], TableHead(('', [], []), []), [], TableFoot(('', [], []), []))
TableBody(('', [], []), RowHeadColumns(0), [], [])
TableFoot(('', [], []), [])
TableHead(('', [], []), [])
TwoParens()
Underline([])
UpperAlpha()
UpperRoman()

This was the code used to obtain the previous output:

import pandoc.types

for type_str, type_val in sorted(pandoc.types._types_dict.items()):
    if issubclass(type_val, pandoc.types.Constructor):
        print(type_val())

boisgera · 2024-05-25T15:06:00Z

boisgera
May 25, 2024
Maintainer

Very nice. I have not unpacked everything you've done yet (for example I need to read more about rich), but I really like it so far!

At this stage, I think that it would be simpler to add you as a collaborator to my repo, so that you can create an experimental branch with your contributions and make changes with minimal friction. Would that be ok with you?

7 replies

tombolano Jun 10, 2024
Collaborator Author

Thanks @boisgera, I have accepted the invitation. I will create an experimental branch (is this name ok or do you prefer another one?) and upload my contributions there.

boisgera Jun 10, 2024
Maintainer

I think that the name experimental is fine ; you could also go for something more specific, such as named-fields (even if your work is not strictly limited to this topic). Your choice!

boisgera Jul 26, 2024
Maintainer

Hi @tombolano,

Areyou still willing to contribute your work as a branch? I'm starting to have a bit of time to work on the project again, I'd like to get to the point where your work can first be contributed as a pure extension (no breaking changes, just some extra features) as a first move.

Cheers,

Sébastien

tombolano Jul 26, 2024
Collaborator Author

Hi @boisgera,

Yes, I'm still willing to contribute my work as a branch. I'm sorry for the long delay; I've been very busy with work over the last few weeks. This week I had time and I checked what work was still needed to do, apart from some minor fixes I think the main thing left is documentation. I've done some local work but haven't committed it yet to a new branch. I'll make the commit tomorrow morning.

Best regards,
Tomás

boisgera Jul 26, 2024
Maintainer

OK great!

I totally get the "bandwith" issue, no worries.

tombolano · 2024-07-27T11:36:42Z

tombolano
Jul 27, 2024
Collaborator Author

Hello @boisgera,

I have uploaded my work to the new branch 'experimental',

Apart from the work already discussed before, this has the following changes:

I added changes to show the default values when printing the types, so for example, for the Pandoc type:
- With the code in the master branch:
```
>>> Pandoc
Pandoc(Meta, [Block])
```
- The default behaviour previously in my branch, as discussed in previous messages, was to show also the field names:
```
>>> Pandoc
Pandoc(meta: Meta, blocks: [Block])
```
- Now in the new branch 'experimental' the default values are also shown:
```
>>> Pandoc
Pandoc(meta: Meta = Meta({}), blocks: [Block] = [])
```
  Of course this is configurable and we can enable or disable showing the field names and the default values.
I removed the print_options function and replaced it with two functions:
- set_types_docstring to set the types docstring, it has two parameters:
  - show_fields to set if we want to show the field names.
  - show_default_values to set if we want to show the field default values.
- set_data_repr to set if we want to display field names in the data repr.
I added to the make_types function the parameters data_show_fields, types_show_fields, and types_show_default_values to configure the display of the types when creating them.
I moved the docstring function from utils.py to types.py, this is to avoid a circular import because with the changes made the function now uses the _get_data_fields and _get_default_value from types.py. We could import the types module inside the docstring function, but I think it is better to just move the docstring function to the types module, since it does not have any other dependencies.
I made some changes to the documentation:
- I Added a small explanation of how to access the document by field names in document.md
- I updated the code outputs to show the field names and default values.
- In the output of the pandoc.configure() examples I added ellipsis to the 'version' and the 'pandoc_types_version' fields so the tests do not depend on the pandoc version used.
- I updated the cookbook.md examples to use field names instead of indices.
- I updated the generate_types_documentation.py file to use paths relative to the file and not the current working directory.
- I made sure that all the tests pass.

However note that there are some things still not documented:

The creation of documents using fields and considering the default values of the fields.
Maybe adding set_types_docstring and set_data_repr to the API documentation.
Pretty printing the elements with the rich library.

The only breaking thing with respect to the master version are the docstrings of the types, which as I said now print the field names and default values, and I don't know if this could even be considered "breaking". However this can be changed to be exactly as in master by running set_types_docstring(show_fields=False, show_default_values=False).

Best regards,
Tomás

6 replies

boisgera Aug 27, 2024
Maintainer

Hello @tombolano & everyone,

I'd like to discuss again the names selected for the constructor arguments in the experimental branch of the Pandoc Python Library. I'd like these names to be derived as simply as possible from our short type definitions (or their original definitions in Text.Pandoc.Definition since they are equivalent), since it's IMHO easier for the library users, easier for the library maintainers and more robust wrt some potential changes in the pandoc types hierarchy.

I consider that many of your derivation tricks are smart and should feel "obvious" to the Python programmer and therefore consistent with the above principle:

The snakecase-ification of type names: a ShortCaption is named short_caption,
The use of plural: a [Cell] argument is named cells,
The simplification of attribute names in records: a citationSuffix in a Citation is simply a suffix,
etc.

But here are some example of what I perceive as deviations:

Int and Double are value (instead of int and double).
The first argument of a header (type: Header Int Attr [Inline]) is not an int but a level. I agree that level is a more precise semantic name (I think that I even asked for this change originally!) and is aligned with the documentation in Text.Pandoc.Definition: "Header - level (integer) and text (inlines)". But this information is unfortunately not present in the original types themselves.
content is used for many different things: mainly instead of inlines and blocks, but sometimes for more complex types. And it's not systematic either: a Pandoc Meta [Block] has a blocks argument, but a Div Attr [Block] has a content argument instead of blocks. Despite the (arguably) more precise semantic match of content, I feel that it's less precise wrt to the types involved and also a significant deviation from the original types definition.

Now, here is the silver lining: some of these issues could be resolved if pandoc (Haskell) could introduce some additional type aliases. For example today , pandoc defines the caption type Caption type as

type ShortCaption = [Inline]
data Caption = Caption (Maybe ShortCaption) [Block]

instead of

data Caption = Caption (Maybe [Inline]) [Block]

I think that this approach is better for pandoc and a huge facilitator for our project, since there is more information embedded into the original type definitions.

Therefore I'd like to propose the following tentative plan (feedback welcome!):

for the moment, we try hard to stick with whatever information is in the original type definition (no level, no value, no content, etc.) and "no special cases" for semantic reasons (only for disambiguation if needed). We keep it simple and stupid.
we discuss this issue with the pandoc project to see if they are open to introduce some extra type aliases in the next versions of pandoc-types for semantic reasons. AFAICT these changes would not require any change in the pandoc library itself or in any code of its users. Let me put @jgm in the loop here to initiate this!
If such changes were adopted, we could reintroduce a more meaningful Python API.

Let me throw here some suggestions of additional type aliases for pandoc:

`Header`

type Level = Int

data Block 
  = ...
  | Header Level Attr [Inline]
  | ...

`DefinitionList`

The current definition of DefinitionList is

data Block
   = ...
   | DefinitionList [([Inline], [[Block]])]
   | ...

which could be decomposed into:

type Term = [Inline]
type Definitions = [[Block]]
type ListItem = (Term, Definitions)

data Block
    = ...
    | DefinitionList [ListItem]
    = ...

The two following proposals are more controversial and intended to avoid raw [[Inline]] and [[Block]] types in the definitions, since they are hard to name. But alternatively, we could keep them in pandoc and decide at the pandoc Python library level that a [[T]] is named list_of_ts (?).

`LineBlock`

The only instance of [[Inline]] in pandoc today. Given the documentation, it would make sense to introduce:

type Line = [Inline]

data Block 
    = ...
    | LineBlock [Line]
    | ...

Lists of Blocks

(Controversial ?)

type ListOfBlocks = [[Block]] in order to replace the occurence of [[Block]] everywhere (?).

I don't like this terminology very much, and I am not sure that would be an improvement for pandoc.

tombolano Aug 30, 2024
Collaborator Author

Hello @boisgera,

I answer below

But here are some example of what I perceive as deviations:

Int and Double are value (instead of int and double).

The first argument of a header (type: Header Int Attr [Inline]) is not an int but a level. I agree that level is a more precise semantic name (I think that I even asked for this change originally!) and is aligned with the documentation in Text.Pandoc.Definition: "Header - level (integer) and text (inlines)". But this information is unfortunately not present in the original types themselves.

content is used for many different things: mainly instead of inlines and blocks, but sometimes for more complex types. And it's not systematic either: a Pandoc Meta [Block] has a blocks argument, but a Div Attr [Block] has a content argument instead of blocks. Despite the (arguably) more precise semantic match of content, I feel that it's less precise wrt to the types involved and also a significant deviation from the original types definition.

Yes, technically would be best that this semantic information was present in the types definitions. I added those deviations because my idea was to have an API similar to the Pandoc Lua API (https://pandoc.org/lua-filters.html#lua-type-reference), and for example the naming of the inlines and blocks is coherent with the names in the Pandoc Lua API, which it is not systematic as you say. I think that the types and fields for the Pandoc Lua API are defined separately from the Pandoc types in the pandoc-lua-engine package (https://github.com/jgm/pandoc/tree/main/pandoc-lua-engine), however I am not sure how it is done since I don't know haskell and I cannot understand well the code.

Now, here is the silver lining: some of these issues could be resolved if pandoc (Haskell) could introduce some additional type aliases. For example today , pandoc defines the caption type Caption type as
type ShortCaption = [Inline]
data Caption = Caption (Maybe ShortCaption) [Block]
instead of
data Caption = Caption (Maybe [Inline]) [Block]
I think that this approach is better for pandoc and a huge facilitator for our project, since there is more information embedded into the original type definitions.

Therefore I'd like to propose the following tentative plan (feedback welcome!):

for the moment, we try hard to stick with whatever information is in the original type definition (no level, no value, no content, etc.) and "no special cases" for semantic reasons (only for disambiguation if needed). We keep it simple and stupid.

we discuss this issue with the pandoc project to see if they are open to introduce some extra type aliases in the next versions of pandoc-types for semantic reasons. AFAICT these changes would not require any change in the pandoc library itself or in any code of its users. Let me put @jgm in the loop here to initiate this!

If such changes were adopted, we could reintroduce a more meaningful Python API.

Yes, I agree, I think that from a technical point of view is best to stick only with the information in the type definitions.

For reference I made the following table of the field names that we generate now in the experimental branch, and the field names that we would obtain if we do not consider any special cases (I ommited the types without fields from the table). Note that some types in the table still have a content field when the field name cannot be derived, the types affected by this are:

BulletList ![[Block]]
DefinitionList ![([Inline], [[Block]])]
LineBlock ![[Inline]]
MetaMap !(Map Text MetaValue)
OrderedList !ListAttributes ![[Block]]

Honestly, I think that the field names without special cases are also good in most cases, for example for me Header is OK with an int field instead of level. The only case that I find weird is the un_meta field of the Meta type.

Type	Field names (experimental branch)	Field names without special cases
BlockQuote	`content`	`blocks`
BulletList	`content`	`content`
Caption	`short_caption`, `content`	`short_caption`, `blocks`
Cell	`attr`, `alignment`, `row_span`, `col_span`, `content`	`attr`, `alignment`, `row_span`, `col_span`, `blocks`
Citation	`id`, `prefix`, `suffix`, `mode`, `note_num`, `hash`	`id`, `prefix`, `suffix`, `mode`, `note_num`, `hash`
Cite	`citations`, `content`	`citations`, `inlines`
Code	`attr`, `text`	`attr`, `text`
CodeBlock	`attr`, `text`	`attr`, `text`
ColSpan	`value`	`int`
ColWidth_	`value`	`double`
DefinitionList	`content`	`content`
Div	`attr`, `content`	`attr`, `blocks`
Emph	`content`	`inlines`
Figure	`attr`, `caption`, `content`	`attr`, `caption`, `blocks`
Format	`text`	`text`
Header	`level`, `attr`, `content`	`int`, `attr`, `inlines`
Image	`attr`, `content`, `target`	`attr`, `inlines`, `target`
LineBlock	`content`	`content`
Link	`attr`, `content`, `target`	`attr`, `inlines`, `target`
Math	`type`, `text`	`type`, `text`
Meta	`map`	`un_meta`
MetaBlocks	`blocks`	`blocks`
MetaBool	`bool`	`bool`
MetaInlines	`inlines`	`inlines`
MetaList	`meta_values`	`meta_values`
MetaMap	`content`	`content`
MetaString	`text`	`text`
Note	`content`	`blocks`
OrderedList	`list_attributes`, `content`	`list_attributes`, `content`
Pandoc	`meta`, `blocks`	`meta`, `blocks`
Para	`content`	`inlines`
Plain	`content`	`inlines`
Quoted	`quote_type`, `content`	`quote_type`, `inlines`
RawBlock	`format`, `text`	`format`, `text`
RawInline	`format`, `text`	`format`, `text`
Row	`attr`, `cells`	`attr`, `cells`
RowHeadColumns	`value`	`int`
RowSpan	`value`	`int`
SmallCaps	`content`	`inlines`
Span	`attr`, `content`	`attr`, `inlines`
Str	`text`	`text`
Strikeout	`content`	`inlines`
Strong	`content`	`inlines`
Subscript	`content`	`inlines`
Superscript	`content`	`inlines`
Table	`attr`, `caption`, `col_specs`, `head`, `bodies`,`foot`	`attr`, `caption`, `col_specs`, `head`, `bodies`, `foot`
TableBody	`attr`, `row_head_columns`, `head`,`body`	`attr`, `row_head_columns`, `rows1`, `rows2`
TableFoot	`attr`, `rows`	`attr`, `rows`
TableHead	`attr`, `rows`	`attr`, `rows`
Underline	`content`	`inlines`

Let me t|hrow here some suggestions of additional type aliases for pandoc:

`Header`

type Level = Int

data Block 
  = ...
  | Header Level Attr [Inline]
  | ...

`DefinitionList`

The current definition of DefinitionList is

data Block
   = ...
   | DefinitionList [([Inline], [[Block]])]
   | ...

which could be decomposed into:

type Term = [Inline]
type Definitions = [[Block]]
type ListItem = (Term, Definitions)

data Block
    = ...
    | DefinitionList [ListItem]
    = ...

Yes, these extra type aliases would be great. I was also thinking of

data TableBody = TableBody !Attr !RowHeadColumns ![Row] ![Row]

Which has 2 [Row] values but it is unknown what each one represents from this definition. A suggestion to make it clearer:

type TableBodyHead = [Row]
data TableBody = TableBody !Attr !RowHeadColumns !TableBodyHead ![Row]

The two following proposals are more controversial and intended to avoid raw [[Inline]] and [[Block]] types in the definitions, since they are hard to name. But alternatively, we could keep them in pandoc and decide at the pandoc Python library level that a [[T]] is named list_of_ts (?).

LineBlock

The only instance of [[Inline]] in pandoc today. Given the documentation, it would make sense to introduce:
type Line = [Inline]

data Block 
    = ...
    | LineBlock [Line]
    | ...
Lists of Blocks

(Controversial ?)

type ListOfBlocks = [[Block]] in order to replace the occurence of [[Block]] everywhere (?).

I don't like this terminology very much, and I am not sure that would be an improvement for pandoc.

I think that type Line = [Inline] is fine because it adds semantic information, with this we could name lines the field of LineBlock. On the other hand I do not like type ListOfBlocks = [[Block]] for all the occurences of [[Block]], since it does not add any new semantic information, as you say we can handle it at the Python library level.

boisgera Sep 2, 2024
Maintainer

Hi Tomás!

Yes, I agree, I think that from a technical point of view is best to stick only with the information in the type definitions.

👍

[...] The only case that I find weird is the un_meta field of the Meta type.

100% agree. Here I think that the issue is that there is actually an underlying Haskell pattern (the use of newtype wrapper for type safety and the corresponding "deconstructor" (unwrapper), prefixed with a "un") but I don't see a Pythonic pattern that would match this and therefore it feels weird. I think that I'll document what I understand of the Haskell context in a separate discussion (or issue) to begin with.

I think that type Line = [Inline] is fine because it adds semantic information, with this we could name lines the field of LineBlock. On the other hand I do not like type ListOfBlocks = [[Block]] for all the occurences of [[Block]], since it does not add any new semantic information, as you say we can handle it at the Python library level.

Agreed on the principle. But do you have a good systematic idea to name a variable of type [[T]] that would be applicable to [[Block]] (and maybe [[Inline]])? Or do we need to fall back to the the generic content name in this case? So far what I am thinking of is either ambiguous/misleading (blocks_collections, many_blocks, list_of_blocks, etc.) or unambiguous but too long to be convients (e.g. list_of_list_of_blocks). I even used ChatGPT to brainstorm a bit, the results where not very good either IMHO (block_grid, block_matrix, etc.)!

And finally, agreed that the arguments of TableBody could be more explicit!

boisgera Sep 3, 2024
Maintainer

[...] The only case that I find weird is the un_meta field of the Meta type.

100% agree. Here I think that the issue is that there is actually an underlying Haskell pattern (the use of newtype wrapper for type safety and the corresponding "deconstructor" (unwrapper), prefixed with a "un") but I don't see a Pythonic pattern that would match this and therefore it feels weird. I think that I'll document what I understand of the Haskell context in a separate discussion (or issue) to begin with.

I've documented here where the unMeta name comes from and why it should imho be disposed of during the Haskell to Python conversion. My conclusion/suggestion at this stage would be to use the name that we canonically associate to fields of type Map Text MetaValue (we have no such association today ; see MetaMap that uses the fallback name content).

Can we come up with something that would work for every Map? Some (mediocre) ideas that need some refinement:

map would allow us to keep the current name and is simple, but then why don't we name [Inline] just list instead of inlines? Why be precise for lists and ambiguous for maps? I feel that inlines is great and I'd like a similar level of specificity for maps.
meta_values or metavalue (depends if we emphasize the collection aspect or individual indexing)? And we keep the index type implicit (or implicit only because it's text, the most common use case?). If we use the plural, we have some ambiguity since inlines could refer to [Inline] or (for example) Map Text Inline. Is it bad though?
text_to_meta_value, meta_value_from_text, etc? Non-ambiguous, very mechanical translation, but long identifiers (Is it an issue? We can still use the [0] indexing if we feel that the generated field names are too cumbersome).

tombolano Sep 8, 2024
Collaborator Author

Hello @boisgera

Agreed on the principle. But do you have a good systematic idea to name a variable of type [[T]] that would be applicable to [[Block]] (and maybe [[Inline]])? Or do we need to fall back to the the generic content name in this case? So far what I am thinking of is either ambiguous/misleading (blocks_collections, many_blocks, list_of_blocks, etc.) or unambiguous but too long to be convients (e.g. list_of_list_of_blocks). I even used ChatGPT to brainstorm a bit, the results where not very good either IMHO (block_grid, block_matrix, etc.)!

I was thinking on list_of_blocks, but mainly because on your previous message, where you said this suggestion:

type ListOfBlocks = [[Block]] in order to replace the occurence of [[Block]] everywhere (?).

I think that the best options are any of list_of_blocks or content, with the information in the types I think this is the best that we can do.

I've documented here where the unMeta name comes from and why it should imho be disposed of during the Haskell to Python conversion. My conclusion/suggestion at this stage would be to use the name that we canonically associate to fields of type Map Text MetaValue (we have no such association today ; see MetaMap that uses the fallback name content).

Thanks a lot for explaining the unMeta thing!, I now understand how these Haskell constructs work.

Can we come up with something that would work for every Map? Some (mediocre) ideas that need some refinement:

map would allow us to keep the current name and is simple, but then why don't we name [Inline] just list instead of inlines? Why be precise for lists and ambiguous for maps? I feel that inlines is great and I'd like a similar level of specificity for maps.

meta_values or metavalue (depends if we emphasize the collection aspect or individual indexing)? And we keep the index type implicit (or implicit only because it's text, the most common use case?). If we use the plural, we have some ambiguity since inlines could refer to [Inline] or (for example) Map Text Inline. Is it bad though?

text_to_meta_value, meta_value_from_text, etc? Non-ambiguous, very mechanical translation, but long identifiers (Is it an issue? We can still use the [0] indexing if we feel that the generated field names are too cumbersome).

I personally like both the map and the text_to_meta_value options. I managed to obtain the text_to_meta_valueand list_of_blocks naming schemas with a few changes to the code, this is the diff of the changes:

diff --git a/src/pandoc/types/__init__.py b/src/pandoc/types/__init__.py
index 6806d63..958035f 100644
--- a/src/pandoc/types/__init__.py
+++ b/src/pandoc/types/__init__.py
@@ -68,15 +68,22 @@ def _rename_duplicate_fields(fields: list[str]):
     return ret
 
 
-def _field_type_name(field_type, field):
+def _field_type_name(field_type, field, recursive=False, prefix=""):
     if (
         isinstance(field, list)
         and field[0] == field_type
         and isinstance(field[1], list)
         and len(field[1]) == 1
-        and isinstance(field[1][0], str)
     ):
-        return field[1][0]
+        if isinstance(field[1][0], str):
+            return _to_snake_case(field[1][0]).lower()
+        elif isinstance(field[1][0], list) and recursive:
+            if name := _field_type_name(field_type, field[1][0], recursive, prefix):
+                return prefix + name
+            else:
+                return None
+        else:
+            return None
     else:
         return None
 
@@ -99,6 +106,27 @@ class Field:
     type: str | list
 
 
+def _get_field_name(type_decl, type_name=None) -> str:
+    if isinstance(type_decl, str):
+        field = type_decl.removeprefix(type_name) if type_name else type_decl
+        field = _to_snake_case(field).lower()
+    elif type_decl[0] == "map":
+        key_field = _get_field_name(type_decl[1][0], type_name).lower()
+        value_field = _get_field_name(type_decl[1][1], type_name).lower()
+        field = f"{key_field}_to_{value_field}"
+    elif field := _field_type_name("maybe", type_decl):
+        if type_name:
+            field = field.removeprefix(type_name.lower()).lstrip("_")
+    elif field := _field_type_name("list", type_decl, True, "list_of_"):
+        if type_name:
+            field = field.removeprefix(type_name.lower()).lstrip("_")
+        field = _to_plural(field)
+    else:
+        field = "content"
+
+    return field
+
+
 def _get_data_fields(decl: list) -> list[Field]:
     """
     Return for each constructor declaration the name & type of each argument.
@@ -144,47 +172,23 @@ def _get_data_fields(decl: list) -> list[Field]:
 
     if type_def[0] == "map":
         for field, type in type_def[1]:
-            field = _to_snake_case(field).lower()
-            field = field.removeprefix(type_name.lower()).lstrip("_")
-            field_names.append(field)
-            field_types.append(type)
-    else:
-        for type in type_def[1]:
-            if isinstance(type, str):
-                if type == "Int" or type == "Double":
-                    field = "value"
-                else:
-                    field = _to_snake_case(type.removeprefix(type_name))
-            elif field := _field_type_name("maybe", type):
+            if field == "un" + type_name:
+                # detect the haskell pattern un<type name>.
+                # In this case we derive the field name from the type
+                field = _get_field_name(type, type_name)
+            else:
                 field = _to_snake_case(field).lower()
                 field = field.removeprefix(type_name.lower()).lstrip("_")
-            elif field := _field_type_name("list", type):
-                if (
-                    type_name != "Pandoc"
-                    and not type_name.startswith("Meta")
-                    and (field == "Block" or field == "Inline")
-                ):
-                    field = "content"
-                else:
-                    field = _to_snake_case(field).lower()
-                    field = field.removeprefix(type_name.lower()).lstrip("_")
-                    field = _to_plural(field)
-            else:
-                field = "content"
 
             field_names.append(field)
             field_types.append(type)
+    else:
+        for type in type_def[1]:
+            field_names.append(_get_field_name(type, type_name))
+            field_types.append(type)
 
     field_names = _rename_duplicate_fields(field_names)
 
-    # Handle special cases
-    if type_name == "Meta":
-        field_names = _rename_fields(field_names, {"un_meta": "map"})
-    elif type_name == "Header":
-        field_names = _rename_fields(field_names, {"value": "level"})
-    elif type_name == "TableBody":
-        field_names = _rename_fields(field_names, {"rows1": "head", "rows2": "body"})
-
     return [Field(n, t) for n, t in zip(field_names, field_types)]

In these changes:

I added a new _get_field_name function to return the name for each field of the types, this simplifies a bit the _get_data_fields function. Here is the code that returns the names for the maps:
```
elif type_decl[0] == "map":
   key_field = _get_field_name(type_decl[1][0], type_name).lower()
   value_field = _get_field_name(type_decl[1][1], type_name).lower()
   field = f"{key_field}_to_{value_field}"
```
It could be easily changed if another name schema is desired.
I modified the _field_type_name so that is able to detect lists of lists and return field names such as list_of_<type>.

with these changes the fields names change for the BulletList, LineBlock, Meta, MetaMap, and OrderedList types. Below is the table of the resulting field names for these types:

Type	Field names (experimental branch)	New Field names
BulletList	`content`	`list_of_blocks`
LineBlock	`content`	`list_of_inlines`
Meta	`map`	`text_to_value`
MetaMap	`content`	`text_to_meta_value`
OrderedList	`list_attributes`, `content`	`list_attributes`, `list_of_blocks`

I you like this I can do a commit with the changes.

boisgera · 2024-09-09T15:46:31Z

boisgera
Sep 9, 2024
Maintainer

Lists of lists

I was thinking on list_of_blocks, but mainly because on your previous message, where you said this suggestion:

type ListOfBlocks = [[Block]] in order to replace the occurence of [[Block]] everywhere (?).

I like it for every reason but one, and it's a big one : the name is probably misleading. A "list of blocks" is probably most likely interpreted as "list of objects of type Block". Don't you think?

I tried to make ChatGPT guess the type of several names and one that works better than list_of_blocks is blocks_list
(It is still ambiguous but I can convince ChatGPT that it's nested structure. Not that blocks_lists works even better for ChatGPT; he doesn't need help to guess the right type!).

At the moment, I'd favor blocks_list:

linguistically less ambiguous
simple implementation (after the first plural suffix, add "_list" for each new list nesting)
attribute completion: you know that you have an container of blocks but don't remember the nesting or naming convention; you don't really care: you type block and let the IDE show you the related attribute (blocks for a doc, blocks_list for a bullet list, etc.)

(I gave a look at the pandoc source code and AFAICT the variables of type [[Block]] are named after their local semantics: they are defs, notes, cells, headers (and maybe more), so that doesn't help us at all.)

Maps

I personally like both the map and the text_to_meta_value options. I managed to obtain the text_to_meta_valueand list_of_blocks naming schemas with a few changes to the code, this is the diff of the changes:

OK, let's pick text_to_meta_value then! (With an underlying scheme of f"{key_name}_to_{value_name}")

Commit

I you like this I can do a commit with the changes.

Please do, thanks a lot! 👍

If you're ok with it (we can take more time to discuss & brainstorm if needed), I'll change the list_of_blocks scheme to blocks_list afterwards.

3 replies

tombolano Sep 9, 2024
Collaborator Author

I like it for every reason but one, and it's a big one : the name is probably misleading. A "list of blocks" is probably most likely interpreted as "list of objects of type Block". Don't you think?

Ah yes, you are totally right, I actually hadn't thought of that 🤕

I tried to make ChatGPT guess the type of several names and one that works better than list_of_blocks is blocks_list (It is still ambiguous but I can convince ChatGPT that it's nested structure. Not that blocks_lists works even better for ChatGPT; he doesn't need help to guess the right type!).

At the moment, I'd favor blocks_list:

linguistically less ambiguous

simple implementation (after the first plural suffix, add "_list" for each new list nesting)

attribute completion: you know that you have an container of blocks but don't remember the nesting or naming convention; you don't really care: you type block and let the IDE show you the related attribute (blocks for a doc, blocks_list for a bullet list, etc.)

(I gave a look at the pandoc source code and AFAICT the variables of type [[Block]] are named after their local semantics: they are defs, notes, cells, headers (and maybe more), so that doesn't help us at all.)

Ok, i also think blocks_list is a better name.

I you like this I can do a commit with the changes.

Please do, thanks a lot! 👍

If you're ok with it (we can take more time to discuss & brainstorm if needed), I'll change the list_of_blocks scheme to blocks_list afterwards.

Ok, great, i will do the commit. Changing the list_of_blocks scheme to blocks_list is not a big deal, I will change it before making the commit.

tombolano Sep 9, 2024
Collaborator Author

I have already commited the changes. However, some functions such as _get_field_name and _field_type_name do not have docstrings, I'm happy to add those if you'd like.

boisgera Sep 10, 2024
Maintainer

Sure, add docstrings if that makes sense to you (and if you can spare the time).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to add named fields and some other improvements #67

{{title}}

Replies: 5 comments 18 replies

{{title}}

{{title}}

{{title}}

{{title}}

Representation as Strings

Type Discoverability

Default Values

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

`Header`

`DefinitionList`

`LineBlock`

Lists of Blocks

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Changes to add named fields and some other improvements #67

tombolano Apr 18, 2024 Collaborator

Replies: 5 comments · 18 replies

boisgera May 5, 2024 Maintainer

tombolano May 9, 2024 Collaborator Author

boisgera May 12, 2024 Maintainer

Representation as Strings

Type Discoverability

Default Values

tombolano May 18, 2024 Collaborator Author

Representation as Strings

Type Discoverability

Default Values

boisgera May 25, 2024 Maintainer

tombolano Jun 10, 2024 Collaborator Author

boisgera Jun 10, 2024 Maintainer

boisgera Jul 26, 2024 Maintainer

tombolano Jul 26, 2024 Collaborator Author

boisgera Jul 26, 2024 Maintainer

tombolano Jul 27, 2024 Collaborator Author

boisgera Aug 27, 2024 Maintainer

Header

DefinitionList

LineBlock

Lists of Blocks

tombolano Aug 30, 2024 Collaborator Author

Header

DefinitionList

LineBlock

Lists of Blocks

boisgera Sep 2, 2024 Maintainer

boisgera Sep 3, 2024 Maintainer

tombolano Sep 8, 2024 Collaborator Author

boisgera Sep 9, 2024 Maintainer

Lists of lists

Maps

Commit

tombolano Sep 9, 2024 Collaborator Author

tombolano Sep 9, 2024 Collaborator Author

boisgera Sep 10, 2024 Maintainer

tombolano
Apr 18, 2024
Collaborator

Replies: 5 comments 18 replies

boisgera
May 5, 2024
Maintainer

tombolano May 9, 2024
Collaborator Author

boisgera
May 12, 2024
Maintainer

tombolano May 18, 2024
Collaborator Author

boisgera
May 25, 2024
Maintainer

tombolano Jun 10, 2024
Collaborator Author

boisgera Jun 10, 2024
Maintainer

boisgera Jul 26, 2024
Maintainer

tombolano Jul 26, 2024
Collaborator Author

boisgera Jul 26, 2024
Maintainer

tombolano
Jul 27, 2024
Collaborator Author

boisgera Aug 27, 2024
Maintainer

`Header`

`DefinitionList`

`LineBlock`

tombolano Aug 30, 2024
Collaborator Author

`Header`

`DefinitionList`

`LineBlock`

boisgera Sep 2, 2024
Maintainer

boisgera Sep 3, 2024
Maintainer

tombolano Sep 8, 2024
Collaborator Author

boisgera
Sep 9, 2024
Maintainer

tombolano Sep 9, 2024
Collaborator Author

tombolano Sep 9, 2024
Collaborator Author

boisgera Sep 10, 2024
Maintainer