Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support standard citation options #1325

Open
rviscomi opened this issue Sep 27, 2020 · 27 comments
Open

Support standard citation options #1325

rviscomi opened this issue Sep 27, 2020 · 27 comments
Labels
design Creating the Almanac UX development Building the Almanac tech stack good first issue Good for newcomers SEO SEO related
Milestone

Comments

@rviscomi
Copy link
Member

rviscomi commented Sep 27, 2020

See https://scholar.google.com/intl/en/scholar/inclusion.html#indexing
I'd love to see Google Scholar automatically crawling and indexing our ebook content.
Edit - this was completed in #2191

We should also add citation options at the bottom of each chapter as discussed in #1325 (comment) and #1325 (comment)

@rviscomi rviscomi added development Building the Almanac tech stack SEO SEO related labels Sep 27, 2020
@rviscomi rviscomi added this to the 2020 Backlog milestone Sep 27, 2020
@nrllh
Copy link
Collaborator

nrllh commented Sep 27, 2020

I think, we should configure also each chapter in that way, to get better impression

@tunetheweb
Copy link
Member

@rviscomi looks like the limit for Google Scholar is 5Mb

Each file must not exceed 5MB in size. To index larger files, or to index scanned images of pages that require OCR, please upload them to Google Book Search.

Our ebook is 17Mb so not eligible.

I think, we should configure also each chapter in that way, to get better impression

We actually include most of this data as Structured Data in the chapters already (JavaScript example).

We could include the extra meta data too in an effort to index there but again most of the chapters are large than the 5MB maximum - for example the CSS chapter comes in at 52MB in a high resolution screen once the interactive graphs have loaded! The non-interactive version with fallback images is 1.2M but not sure if scholar bot will crawl for that.

@rviscomi
Copy link
Member Author

My understanding is the 5MB limit applies to the PDF version but not the HTML version, which can be indexed and searchable in Scholar. There may be metadata that we can add to the HTML version of the ebook to make it more search friendly.

@tunetheweb
Copy link
Member

Not convinced about that:

1. File formats

Your files need to be either in the HTML or in the PDF format. PDF files must have searchable text, i.e., you must be able to search for and find words in the document using Adobe Acrobat Reader.

Each file must not exceed 5MB in size. To index larger files, or to index scanned images of pages that require OCR, please upload them to Google Book Search.

Also, at the moment we explicitly stop Google from indexing the HTML ebook page, to stop it competing with the PDF version:

<meta name="robots" content="noindex">

Similarly only the PDF version is in our sitemap.

Would presumably need to change both of those as part of this if we did want to proceed.

@rviscomi
Copy link
Member Author

There is some Scholar-friendly metadata we could add: https://scholar.google.com/intl/en/scholar/inclusion.html#indexing

For example:

<meta name="citation_title" content="The testis isoform of the phosphorylase kinase catalytic subunit (PhK-T) plays a critical role in regulation of glycogen mobilization in developing lung">
<meta name="citation_author" content="Liu, Li">
<meta name="citation_author" content="Rannels, Stephen R.">
<meta name="citation_author" content="Falconieri, Mary">
<meta name="citation_author" content="Phillips, Karen S.">
<meta name="citation_author" content="Wolpert, Ellen B.">
<meta name="citation_author" content="Weaver, Timothy E.">
<meta name="citation_publication_date" content="1996/05/17">
<meta name="citation_journal_title" content="Journal of Biological Chemistry">
<meta name="citation_volume" content="271">
<meta name="citation_issue" content="20">
<meta name="citation_firstpage" content="11761">
<meta name="citation_lastpage" content="11766">
<meta name="citation_pdf_url" content="http://www.example.com/content/271/20/11761.full.pdf">

Would be great to have our own content appearing in Scholar, in addition to the citations from other research papers!

@rviscomi rviscomi added the good first issue Good for newcomers label Apr 28, 2021
@nrllh
Copy link
Collaborator

nrllh commented Apr 30, 2021

@rviscomi, I think it's interesting to provide a citing recommendation for each chapter (as text and BibTeX). See an example here. Then people know how they should cite, and all references will be uniform. Otherwise, it will be hard for the scholar to assign references to a chapter if authors reference the chapters differently. Maybe we can use this as the title in our recommendation: The {year} Web Almanac: {chapter}.

@rviscomi rviscomi added the design Creating the Almanac UX label May 3, 2021
@rviscomi
Copy link
Member Author

rviscomi commented May 3, 2021

Really like that idea @nrllh! Adding the design label to loop in @HTTPArchive/designers to think about how to expose the citation UX.

@shantsis
Copy link
Contributor

shantsis commented May 4, 2021

Is the idea of this to provide a standard MLA/latex type citation block that can be copied elsewhere?

@rviscomi
Copy link
Member Author

rviscomi commented May 4, 2021

@nrllh has more publishing experience and can elaborate more on his idea, but yes I think that's exactly it. For example, if I search for almanac.httparchive.org on Google Scholar I get results like this, with a button to copy a citation:

image
image

@nrllh
Copy link
Collaborator

nrllh commented May 5, 2021

@shantsis yes, that's our goal.

@shantsis
Copy link
Contributor

Perhaps something like this above or below the author (with whichever formats we choose)
citation

@rviscomi
Copy link
Member Author

@shantsis nice work, I like it!

Does anyone else have any feedback or suggestions? If not we can pass this to the dev team for implementation.

@nrllh
Copy link
Collaborator

nrllh commented May 11, 2021

Here my suggestion:

BibTex - based on this template:

@TechReport{ {author1_lastname}.Almanac.{year},
author = "{author1_lastname, author1_firstname} { and author2_lastname, author2_firstname} { and author3_lastname, author3_firstname} ",
title = "The {year} Web Almanac: {chapter}",
institution = "HTTPArchive",
year = "{year}"
note = "Available as \url{url}"
}

The output for security chapter 2020 will be then:

@techreport{VanGoethem.Almanac.2020,
  author      = "Van Goethem, Tom and Demir, Nurullah and Pollard, Barry",
  title       = "The 2020 Web Almanac: Security",
  institution = "HTTPArchive",
  year        = "2020",  
  note        = "Available as \url{https://almanac.httparchive.org/en/2020/security}"
}

MLA - based on this template:

{author1_lastname}, {author1_firstname}. The {year} Web Almanac: {chapter}, HTTPArchive, {year}, {url}.

The output for security chapter 2020 will be then:

Van Goethem, Tom. The 2020 Web Almanac: Security, HTTPArchive, 2020, Available as \url{https://almanac.httparchive.org/en/2020/security}.

IEEE - based on this template

{author1_firstname}[0]. {author1_lastname}, {author2_firstname}[0]. {author2_lastname}, The {year} Web Almanac: {chapter}, HTTPArchive, {year}. [Online]. {url}, Accessed on: {date_today}.

The output for security chapter 2020 will be then:

T. Van Goethem, N. Demir, B. Pollard. The 2020 Web Almanac: Security, HTTPArchive, 2020. [Online]. \url{https://almanac.httparchive.org/en/2020/security}, Accessed on 11.05.2021}.

I think APA style is irrelevant for us (s. here).

@tunetheweb
Copy link
Member

Nice. Any thoughts on where it should go?

After the Conclusion, before Explore the results?
Right at the bottom, just before the footer?

@nrllh
Copy link
Collaborator

nrllh commented May 11, 2021

After the Conclusion, before Explore the results?

I think this is a good place

@shantsis
Copy link
Contributor

Yup or right below that and above the author. Either works :)

@tunetheweb tunetheweb changed the title Optimize ebook for indexing Support standard citation options May 12, 2021
@tunetheweb
Copy link
Member

OK seems like we have the agreed approach and the design. So I've changed the title of the issue and updated the first comment.

@HTTPArchive/developers anyone want to take this one?

@VictorLeP
Copy link
Contributor

  • Did Google Scholar ever pick up the meta data from Add Google Scholar meta data #2191? I don't see anything when I search "Web Almanac", nor any entries that seem to come from that meta data (for the few academic authors that I know to have a profile).
  • I'm planning to have the chapter that I co-wrote listed in KU Leuven's institutional repository, such that I can have it count as "scientific output". I'll reach out to the maintainers of that repository to get some details on how they would handle this (is it a journal? a book? a technical report? an online publication? what about copyright? do they want page numbers? a separate PDF for the chapter?), and would be happy to report back on what they tell me.
    I'll probably only do it after the release on 1 December. Question on that: can the chapter numbers / e-book pages still change after 1 December (for the few chapters where it is unclear whether they'll still be published - Temporarily remove some chapters from the ToC #2512)?
  • mini thing: we would need authors to provide their first and last name separately for the citations to work properly - automatic splitters tend to fail on my name for example :)

@tunetheweb
Copy link
Member

Some of them have started to show in skeleton form - but not sure if that's because of #2191 or because they were already being cited (interestingly one shows in a translated form - which suggests it's probably the later):

image

The do say:

Keep in mind that changes that you make on your website will usually not be reflected in Google Scholar search results for some time. New papers are normally added several times a week; however, updates of papers that are already included usually take 6-9 months. Updates of papers on very large websites may take several years, because to update a site, we need to recrawl it - the time it takes to recrawl a large site is usually limited by the speed at which the target website is able to deliver content to the search robots.

Will be interesting to see if the 2021 chapters are indexed quicker since they have these. Or maybe they're just deemed a good fit for Google Scholar (despite being cited by several of the other papers). 🤷 Either way I still think it would be good to have the human readable citation options at the bottom of the page as have been asked once or twice about how to cite this officially.

@VictorLeP
Copy link
Contributor

Some of them have started to show in skeleton form - but not sure if that's because of #2191 or because they were already being cited (interestingly one shows in a translated form - which suggests it's probably the later):

Pretty sure it's the latter, the meta data has as title "The 2019 Web Almanac: JavaScript", not what is shown in the image. I'm pretty sure Google Scholar creates [CITATION] entries for resources that are referenced in academic works but that it fails to match to any known item (also for newspaper articles, for example).

Will be interesting to see if the 2021 chapters are indexed quicker since they have these.

Indeed! Though having just looked at the Privacy chapter, the publication date is weird: 2021/05/02?

Or maybe they're just deemed a good fit for Google Scholar (despite being cited by several of the other papers).

That Google Scholar guide also mentions:

these fields must contain sufficient information to identify a reference to this paper from another document

So it might fail because it does not think there is enough information...

Either way I still think it would be good to have the human readable citation options at the bottom of the page as have been asked once or twice about how to cite this officially.

Certainly, we might cite it as well at some point :)

@VictorLeP
Copy link
Contributor

Update from the university: since this is targeted towards a non-academic audience, they think "Scientific outreach" is the best category, so they don't consider it a book or journal. (maybe they'd be happier if we had an ISBN/ISSN)

Google Scholar also doesn't appear to have picked up the 2021 chapters (yet).

@nrllh
Copy link
Collaborator

nrllh commented Dec 6, 2021

@VictorLeP I think Google Scholar will also not index it, if we get a DOI or ISBN it'll be perfect. Our 2020 version was published in Google Books[1] (and Play Store) but it still doesn't appear in the Scholar.

[1] https://www.google.de/books/edition/The_2020_Web_Almanac/wqcPEAAAQBAJ?hl=de&gbpv=0

@VictorLeP
Copy link
Contributor

VictorLeP commented Dec 6, 2021

An ISBN seems to cost $125 (in the US); a DOI can be derived from an ISBN.

There seem to be a number of ways to get only a DOI, possibly for free. It seems you usually do have to upload some file. One provider missing in those posts is OSF, which provides DOIs and has an option to "soft redirect" to a link (that is, you get a pop-up).

It might actually be nice if we could get one DOI per chapter instead of one for the Almanac as a whole.

@tunetheweb
Copy link
Member

Also need to remember the translations. So at $125 per language, per chapter, per year that could add up! Though you ca often buy them in bulk much cheaper. We discussed getting ISBNs here: #1219

I'm not sure we need to get into Google Scholar. It's a nice to have since we are cited in so many articles in there already. and it's potentially another way of making the content available to those that might not otherwise find it. But other than that I'm not desperate to invest in an ISBN or DOI just to get cited in there.

However I do think it would be good to tell people how to cite our articles with the above suggested addition to our web pages, since we are cited a lot and we have been asked the question before.

@VictorLeP
Copy link
Contributor

VictorLeP commented Dec 6, 2021

I don't think you need to/can get an ISBN per chapter, but it would still be X years times Y languages so it could indeed get expensive fast.

A standard way to cite might actually be sufficient. As I mentioned, Google Scholar picks up on these citations, so it might be an indirect way to get indexed there. I think that my submission of the chapter metadata to the KU Leuven repository will also trigger a Google Scholar entry (albeit only for the Privacy chapter).

In terms of the citation itself, I don't really see what we couldn't go for an actual (book) chapter, for example with this (BibLaTeX!) template:

@inbook{ WebAlmanac.{year}.{chapter_number},
  author = "{author1_lastname, author1_firstname} 
       { and author2_lastname, author2_firstname} 
       { and author3_lastname, author3_firstname}",
  title = "{chapter}",
  booktitle = "{year} Web Almanac",
  chapter = {chapter_number},
  pages = "{ebook_pages}",
  publisher = "HTTP Archive",
  year = "{year}",
  url = "{url}"
}

@tunetheweb
Copy link
Member

BTW we have this meta data in the chapters already:

    <meta name="citation_title" content="The 2021 Web Almanac: Privacy">
    <meta name="citation_author" content="Yana Dimova">
    <meta name="citation_author" content="Victor Le Pochat">
    <meta name="citation_publication_date" content="2021/11/17">
    <meta name="citation_journal_title" content="The 2021 Web Almanac">
    <meta name="citation_volume" content="3">
    <meta name="citation_issue" content="11">
    <meta name="citation_publisher" content="HTTP Archive">
    <meta name="citation_technical_report_institution" content="HTTP Archive">
    <meta name="citation_language" content="English">
    <meta name="citation_fulltext_html_url" content="https://almanac.httparchive.org/en/2021/privacy">

This was added in May this year.

And we've had this JSON-LD meta data in there too since the original 2019 launch:

	{
	  "@context": "http://schema.org",
	  "@type": "Article",
	  "mainEntityOfPage": {
	  	  "@type": "WebPage",
	  	  "@id": "https://almanac.httparchive.org/en/2021/privacy"
	  },
	  "headline": "Privacy | 2021 | The Web Almanac by HTTP Archive",
	  "image": {
	  	  "@type": "ImageObject",
	  	  "url": "https://almanac.httparchive.org/static/images/2020/privacy/hero_lg.jpg",
	  	  "height": 433,
	  	  "width": 866
	  },
	  "publisher": {
	  	  "@type": "Organization",
	  	  "name": "HTTP Archive",
	  	  "logo": {
	  	      "@type": "ImageObject",
	  	      "url": "https://almanac.httparchive.org/static/images/ha.png",
	  	      "height": 160,
	  	      "width": 320
	  	  },
        "sameAs": [
          "https://httparchive.org",
          "https://twitter.com/HTTPArchive",
          "https://github.com/HTTPArchive"
          ]
      },
    "author":
      
      [{
        "@type": "Person",
          "sameAs": [
            "https://almanac.httparchive.org/en/2021/contributors#ydimova"
            
            ,"https://github.com/ydimova"
            
            ],
        "name": "Yana Dimova"
      },{
        "@type": "Person",
          "sameAs": [
            "https://almanac.httparchive.org/en/2021/contributors#victorlep"
            ,"https://twitter.com/VictorLePochat"
            ,"https://github.com/VictorLeP"
            ,"https://www.linkedin.com/in/victor-le-pochat/"
            ],
        "name": "Victor Le Pochat"
      }]
,
      "description": "Privacy chapter of the 2021 Web Almanac covering adoption and impact of online tracking, privacy preference signals and browser initiatives for a privacy-friendlier web.",
      "datePublished": "2021-11-17T00:00:00.000Z",
      "dateModified": "2021-12-04T00:00:00.000Z"
	}

@thibaudcolas
Copy link
Member

thibaudcolas commented Oct 7, 2022

If all we want is a DOI, I was recommended Zenodo. It’s a CERN project, completely free, allows 50GB per upload. Takes about 2min to upload one PDF with minimal metadata, longer if we fill in a lot of details.

Here is an upload I made of three pages from this year’s accessibility chapter, on their sandbox server: https://sandbox.zenodo.org/record/1112032. Those three pages got a pretend DOI of 10.5072/zenodo.1112032.
As far as I understand, even on the real server, the DOI is generated as soon as you hit "publish" and confirm.

nrllh added a commit that referenced this issue Aug 9, 2024
Implemented a citation box in BibTeX format as per the discussion in issue #1325. Details: #1325
tunetheweb added a commit that referenced this issue Nov 10, 2024
* Add BibTeX citation box (#1325)

Implemented a citation box in BibTeX format as per the discussion in issue #1325. Details: #1325

* Update page.css

Fixing fff issue from linter.

* Linting errors

* Add DOI

* Formatting

* Internationalisation

---------

Co-authored-by: Mike Gifford <[email protected]>
Co-authored-by: Barry Pollard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Creating the Almanac UX development Building the Almanac tech stack good first issue Good for newcomers SEO SEO related
Projects
None yet
Development

No branches or pull requests

6 participants