404s within LLVM documentation

Hi all,

I’m currently in the process of updating the Kaleidoscope tutorials (first and foremost, the ORC/BuildingAJIT ones), and I’ve noticed a fair few 404s which are lingering within the current visible documentation. Some of these don’t seem to have linked to existing pages for a while.

I was wondering if there was a way to set up a check in the buildbot to ensure that documentation doesn’t break between builds? I’m happy to fix the current dead links I’ve found (see below) but thought it might be wise to set up a more automated approach in the future. Does anyone have any tips on how I’d go about doing this/if this should be set up at all?

I ran a web crawler to find each of the dead links (this may not be exhaustive), and they are as follows:

https://llvm.org/docs/TestSuiteMakefileGuide
https://llvm.org/docs/doxygen/structLICM.html
https://llvm.org/docs/tutorial/LangImpl5.html#for-loop-expression
https://llvm.org/docs/tutorial/LangImpl7.html#user-defined-local-variables
http://llvm.org/docs/lnt/modindex.html
https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl6.html#user-defined-unary-operators
https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl5.html#for-loop-expression
https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl7.html#user-defined-local-variables
https://llvm.org/docs/tutorial/LangRef.html#instruction-reference
https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl4.html#adding-a-jit-compiler
https://llvm.org/docs/tutorial/WritingAnLLVMPass.html
https://llvm.org/docs/tutorial/Passes.html
https://llvm.org/docs/tutorial/ProgrammersManual.html#viewing-graphs-while-debugging-code
https://llvm.org/docs/tutorial/SourceLevelDebugging.html
https://llvm.org/docs/tutorial/Frontend/PerformanceTips.html
https://llvm.org/docs/tutorial/GetElementPtr.html
https://llvm.org/docs/tutorial/GarbageCollection.html
https://llvm.org/docs/tutorial/ExceptionHandling.html
https://www.llvm.org/docs/doxygen/structLICM.html
http://llvm.org/docs/TestSuiteMakefileGuide
http://llvm.org/docs/doxygen/structLICM.html
https://www.llvm.org/docs/TestSuiteMakefileGuide
http://llvm.org/docs/tutorial/LangImpl5.html#for-loop-expression
http://llvm.org/docs/tutorial/LangImpl7.html#user-defined-local-variables

Some of these are trivial mistakes (i.e. https://llvm.org/docs/tutorial/LangRef.html#instruction-referencehttps://llvm.org/docs/LangRef.html#instruction-reference), and some require a bit more inspection.

Regards,
Patrick

I don’t know the best way to do this, but big +1 that it would be great if there could somehow be tests for broken documentation.

I don’t know this for sure, but I believe there aren’t even currently bots testing if the doxygen and sphinx docs were successfully built.

-Alex

Patrick, You have identified a good way to do this. Given it is likely that the links are to files in a directory structure on a single server with that file structure/path given by the link text, as we see in your dead link list, and that in a good number, perhaps likely a large majority of the cases, that the file names (less the directory path) are unique,

It would be a fairly direct procedure to associate links by their file name (less path) with file locations. The process would then update the links for the correct paths, list links without an existing file, and list dead links having more than one existing file with the same name.

The frequency of that run would depend on the frequency of dead-link discovery that the run could provide.

Regards, Neil Nelson

Patrick, how long does the crawl take? I suspect if we fixed internal documentation links so that they point to local copies of documentation when building locally it would be quite quick (no actual idea though). That in turn would probably make it feasible to add to the existing documentation build bots, I think.

James

It would be a fairly direct procedure to associate links by their file name (less path) with file locations. The process would then update the links for the correct paths, list links without an existing file, and list dead links having more than one existing file with the same name.

and

Patrick, how long does the crawl take? I suspect if we fixed internal documentation links so that they point to local copies of documentation when building locally it would be quite quick (no actual idea though).

That crawl was actually done on the live site, using the linkchecker tool.

Doing it locally would indeed be much better, and it turns out Sphinx has a builtin tool for doing such a check (cd llvm/docs && make -f Makefile.sphinx linkcheck), but also checks external hyperlinks are reachable. Now, the runtime for this can be seriously reduced if we change all internal document links to actually point to internal document links (i.e. link to /docs/foo/bar, rather than https://llvm.org/docs/foo/bar, or llvm.org/docs/foo/bar - easily fixable), so as to avoid an internet check. I do believe we should check external links still, as having documentation link to nowhere can be jarring, however I don’t think such crawls need to be as frequent.

Cheers,
Patrick

A practical way to proceed may be to have LLVM provide an html file list from their server by going to the top level directory and executing the following command

find . -name ‘*.htm?’ > llvm.org_html_file_list

giving all file names with parent directories for extensions with html or htm. It may be that there are multiple top directories of interest, such as one for , that could also be put into their own file lists, though this is secondary at the moment. Having the name of that top level directory in each case may help or the top level web-page name could work. We just need to be sure the changes get back to the proper directory. Tar or zip the list(s) for easy download.

The LLVM html files could then be downloaded to a local user’s computer from the list using wget, the analysis done and the changes made. The changes could then be uploaded to using diff files as patches or as LLVM directs.

Without the file lists from LLVM for this local procedure, the only option would be to remove the html link tags for the dead-links, which removes an easy ability to make corrections, if can be done, to those links. This procedure be done by downloading the LLVM site’s html pages through page links with wget. Since possibly useful information is lost with this procedure it is not likely a preferred option.

The first option, without parent pages for the dead-links below, would tend to require the download of possibly all or most of the html files in the list in order to find those few of concern. Whether or not there are copyright or other issues with downloading large chunks of the LLVM site may be considered.

There is an option in wget when downloading a site to change all the links to local files in a manner Patrick suggests that may obtain that objective. Considering the scale of that change it would best be done on the LLVM server in the manner of a copy with changes using wget and then directing a browser to the copy to see that result before going live. It may be the case that wget would not work or further link changes done with a program would be required. It would be easy to redirect back to the prior LLVM site if critical problems were found later. But the scale of this change suggests it would be done with more detailed consideration at LLVM as against the relatively few dead-link changes to this point identified that could be addressed with diff uploads.

The option for writing a program for the dead-link analysis and changes seems less likely in that the programmer would need to write for an environment not immediately available to him and a program would not allow the more incremental and clear visibility of diff uploads.

Regards, Neil Nelson