Skip to content

Commit 8dfd4ed

Browse files
committed
Fix indentation error and linelength warnings.
1 parent c82f90f commit 8dfd4ed

File tree

1 file changed

+64
-43
lines changed

1 file changed

+64
-43
lines changed

classes/robot/crawler.php

+64-43
Original file line numberDiff line numberDiff line change
@@ -1000,60 +1000,81 @@ private static function determine_filesize($curlhandle, $method, $success, $body
10001000

10011001
/* Implementation-specific notes, currently not part of the API:
10021002
*
1003-
* This function implements an HTTP client built on Curl. In the usual case, when everything runs smoothly, it uses keep-alive
1004-
* connections when possible.(*) It issues HEAD requests in order to find out about the media type and length of the resource.
1005-
* If the target resource is an HTML document, uses a GET request to retrieve it, and extracts and stores the document title. In
1006-
* case of errors, these are recorded in the returned result object.
1003+
* This function implements an HTTP client built on Curl. In the usual
1004+
* case, when everything runs smoothly, it uses keep-alive connections when
1005+
* possible.(*) It issues HEAD requests in order to find out about the
1006+
* media type and length of the resource. If the target resource is an
1007+
* HTML document, uses a GET request to retrieve it, and extracts and
1008+
* stores the document title. In case of errors, these are recorded in the
1009+
* returned result object.
10071010
*
1008-
* (*) XXX: future possible extension: reuse Curl handles across function calls so that we can reuse a handle for more than one
1009-
* request. This will be beneficial when loading lots of resources from a single web server (in most cases, the own Moodle web
1010-
* server) as initializing a TCP connection takes quite some time.
1011+
* (*) XXX: future possible extension: reuse Curl handles across function
1012+
* calls so that we can reuse a handle for more than one request. This will
1013+
* be beneficial when loading lots of resources from a single web server
1014+
* (in most cases, the own Moodle web server) as initializing a TCP
1015+
* connection takes quite some time.
10111016
*
1012-
* The amount of transmitted data is marginally increased by the additional HEAD request and response(s). The time needed to
1013-
* handle URIs may also increase slightly. As a result of using HEAD first, followed by a possible GET, the number of requests
1014-
* to the server is often doubled. But the needed time is not, due to keep-alive connections, so this is neglegible. Big
1015-
* resources are not downloaded at all or are not entirely downloaded. Main purpose of this is to avoid starting a download of a
1016-
* non-HTML document of which the size is already known after HEAD processing. This is a common case on the web.
1017+
* The amount of transmitted data is marginally increased by the additional
1018+
* HEAD request and response(s). The time needed to handle URIs may also
1019+
* increase slightly. As a result of using HEAD first, followed by a
1020+
* possible GET, the number of requests to the server is often doubled. But
1021+
* the needed time is not, due to keep-alive connections, so this is
1022+
* neglegible. Big resources are not downloaded at all or are not entirely
1023+
* downloaded. Main purpose of this is to avoid starting a download of a
1024+
* non-HTML document of which the size is already known after HEAD
1025+
* processing. This is a common case on the web.
10171026
*
1018-
* If the queried web server is not a general-purpose web server (see RFC 7231 section 4.1
1019-
* <https://tools.ietf.org/html/rfc7231#section-4.1>), it possibly does not support HEAD, but only understands GET. The server
1020-
* will signal this in the response with 405 Method Not Allowed. If this happens, this function switches to GET.
1027+
* If the queried web server is not a general-purpose web server (see RFC
1028+
* 7231 section 4.1 <https://tools.ietf.org/html/rfc7231#section-4.1>), it
1029+
* possibly does not support HEAD, but only understands GET. The server
1030+
* will signal this in the response with 405 Method Not Allowed. If this
1031+
* happens, this function switches to GET.
10211032
*
1022-
* For security reasons, if the server does not tell about the resource media type, this function does _not_ employ content
1023-
* sniffing to find out whether the referenced representation is an HTML document. Instead, it assumes the media type to be
1024-
* "application/octet-stream" (which means that it ignores the content of the document). See RFC 7231 section 3.1.1.5
1025-
* <https://tools.ietf.org/html/rfc7231#section-3.1.1.5>.
1033+
* For security reasons, if the server does not tell about the resource
1034+
* media type, this function does _not_ employ content sniffing to find out
1035+
* whether the referenced representation is an HTML document. Instead, it
1036+
* assumes the media type to be "application/octet-stream" (which means
1037+
* that it ignores the content of the document). See RFC 7231 section
1038+
* 3.1.1.5 <https://tools.ietf.org/html/rfc7231#section-3.1.1.5>.
10261039
*
1027-
* The download size is almost always limited: this function employs TOOL_CRAWLER_HEADER_LIMIT as size limit for each of the
1028-
* HTTP headers (NB: not header-fields). External resources are usually not downloaded in full, but at most
1029-
* TOOL_CRAWLER_DOWNLOAD_LIMIT octets are retrieved. This is normally enough by far to extract the title of external HTML
1040+
* The download size is almost always limited: this function employs
1041+
* TOOL_CRAWLER_HEADER_LIMIT as size limit for each of the HTTP headers
1042+
* (NB: not header-fields). External resources are usually not downloaded
1043+
* in full, but at most TOOL_CRAWLER_DOWNLOAD_LIMIT octets are retrieved.
1044+
* This is normally enough by far to extract the title of external HTML
10301045
* documents.
10311046
*
1032-
* When redirections are followed, the size of the HTTP bodies (e.g. documents informing about the redirection) is limited, too,
1033-
* with TOOL_CRAWLER_REDIRECTION_DOWNLOAD_LIMIT as the maximum allowed size.
1047+
* When redirections are followed, the size of the HTTP bodies (e.g.
1048+
* documents informing about the redirection) is limited, too, with
1049+
* TOOL_CRAWLER_REDIRECTION_DOWNLOAD_LIMIT as the maximum allowed size.
10341050
*
1035-
* There is normally no need to fully download non-HTML resources, even if their size cannot be determined from the headers. The
1036-
* function will store fuzzy sizes as well because even incomplete information can be useful in reports. Sizes can either be
1037-
* unknown; or be exact; or be inexact, but a lower bound (in case of aborted downloads).
1051+
* There is normally no need to fully download non-HTML resources, even if
1052+
* their size cannot be determined from the headers. The function will
1053+
* store fuzzy sizes as well because even incomplete information can be
1054+
* useful in reports. Sizes can either be unknown; or be exact; or be
1055+
* inexact, but a lower bound (in case of aborted downloads).
10381056
*
1039-
* In most cases, it is sufficient for the average web out there and for average users of crawler reports to report external
1040-
* non-HTML documents as having an unknown size if the web server has not provided any. In order to accommodate to other users’
1041-
* wishes, this function allows to be configured: some details of how aggressive this function tries to determine resource
1042-
* lengths and HTML document titles can be adjusted by the configuration settings of the plugin; see the API documentation
1057+
* In most cases, it is sufficient for the average web out there and for
1058+
* average users of crawler reports to report external non-HTML documents
1059+
* as having an unknown size if the web server has not provided any. In
1060+
* order to accommodate to other users’ wishes, this function allows to be
1061+
* configured: some details of how aggressive this function tries to
1062+
* determine resource lengths and HTML document titles can be adjusted by
1063+
* the configuration settings of the plugin; see the API documentation
10431064
* comments for TOOL_CRAWLER_NETWORKSTRAIN_*.
10441065
*
1045-
* While _external_ documents do not need to be fully retrieved, _HTML documents_ which are located _on the own Moodle web
1046-
* server_ are always fully retrieved and parsed. This is necessary so that their links can be followed.
1066+
* While _external_ documents do not need to be fully retrieved, _HTML
1067+
* documents_ which are located _on the own Moodle web server_ are always
1068+
* fully retrieved and parsed. This is necessary so that their links can be
1069+
* followed.
10471070
*
1048-
* The code of this function has to consider at least the following things that can happen (possibly combined):
1049-
* * curl_exec() signals an error,
1050-
* * 405 Method Not Allowed in response to HEAD request,
1051-
* * oversize header,
1052-
* * oversize body in response to GET request,
1053-
* * HTTP redirection,
1054-
* * transfer is aborted by this function itself,
1055-
* * resource is located on an _external_ host,
1056-
* * redirection points to an external host, but the target resource is located on our web server again.
1071+
* The code of this function has to consider at least the following things
1072+
* that can happen (possibly combined): * curl_exec() signals an error, *
1073+
* 405 Method Not Allowed in response to HEAD request, * oversize header, *
1074+
* oversize body in response to GET request, * HTTP redirection, * transfer
1075+
* is aborted by this function itself, * resource is located on an
1076+
* _external_ host, * redirection points to an external host, but the
1077+
* target resource is located on our web server again.
10571078
*/
10581079

10591080
/**
@@ -1110,7 +1131,7 @@ public function scrape($url) {
11101131
if ($config->networkstrain == TOOL_CRAWLER_NETWORKSTRAIN_REASONABLE) {
11111132
$sizelimit = TOOL_CRAWLER_DOWNLOAD_LIMIT;
11121133
} else if ($config->networkstrain == TOOL_CRAWLER_NETWORKSTRAIN_WASTEFUL) {
1113-
// Always fully download if not aborted by other conditions (like: Content-Length known for non-HTML documents).
1134+
// Always fully download if not aborted by other conditions (like: Content-Length known for non-HTML documents).
11141135
$sizelimit = -1; // No size limit.
11151136
} else {
11161137
$sizelimit = $config->bigfilesize * 1000000;

0 commit comments

Comments
 (0)