@@ -1000,60 +1000,81 @@ private static function determine_filesize($curlhandle, $method, $success, $body
1000
1000
1001
1001
/* Implementation-specific notes, currently not part of the API:
1002
1002
*
1003
- * This function implements an HTTP client built on Curl. In the usual case, when everything runs smoothly, it uses keep-alive
1004
- * connections when possible.(*) It issues HEAD requests in order to find out about the media type and length of the resource.
1005
- * If the target resource is an HTML document, uses a GET request to retrieve it, and extracts and stores the document title. In
1006
- * case of errors, these are recorded in the returned result object.
1003
+ * This function implements an HTTP client built on Curl. In the usual
1004
+ * case, when everything runs smoothly, it uses keep-alive connections when
1005
+ * possible.(*) It issues HEAD requests in order to find out about the
1006
+ * media type and length of the resource. If the target resource is an
1007
+ * HTML document, uses a GET request to retrieve it, and extracts and
1008
+ * stores the document title. In case of errors, these are recorded in the
1009
+ * returned result object.
1007
1010
*
1008
- * (*) XXX: future possible extension: reuse Curl handles across function calls so that we can reuse a handle for more than one
1009
- * request. This will be beneficial when loading lots of resources from a single web server (in most cases, the own Moodle web
1010
- * server) as initializing a TCP connection takes quite some time.
1011
+ * (*) XXX: future possible extension: reuse Curl handles across function
1012
+ * calls so that we can reuse a handle for more than one request. This will
1013
+ * be beneficial when loading lots of resources from a single web server
1014
+ * (in most cases, the own Moodle web server) as initializing a TCP
1015
+ * connection takes quite some time.
1011
1016
*
1012
- * The amount of transmitted data is marginally increased by the additional HEAD request and response(s). The time needed to
1013
- * handle URIs may also increase slightly. As a result of using HEAD first, followed by a possible GET, the number of requests
1014
- * to the server is often doubled. But the needed time is not, due to keep-alive connections, so this is neglegible. Big
1015
- * resources are not downloaded at all or are not entirely downloaded. Main purpose of this is to avoid starting a download of a
1016
- * non-HTML document of which the size is already known after HEAD processing. This is a common case on the web.
1017
+ * The amount of transmitted data is marginally increased by the additional
1018
+ * HEAD request and response(s). The time needed to handle URIs may also
1019
+ * increase slightly. As a result of using HEAD first, followed by a
1020
+ * possible GET, the number of requests to the server is often doubled. But
1021
+ * the needed time is not, due to keep-alive connections, so this is
1022
+ * neglegible. Big resources are not downloaded at all or are not entirely
1023
+ * downloaded. Main purpose of this is to avoid starting a download of a
1024
+ * non-HTML document of which the size is already known after HEAD
1025
+ * processing. This is a common case on the web.
1017
1026
*
1018
- * If the queried web server is not a general-purpose web server (see RFC 7231 section 4.1
1019
- * <https://tools.ietf.org/html/rfc7231#section-4.1>), it possibly does not support HEAD, but only understands GET. The server
1020
- * will signal this in the response with 405 Method Not Allowed. If this happens, this function switches to GET.
1027
+ * If the queried web server is not a general-purpose web server (see RFC
1028
+ * 7231 section 4.1 <https://tools.ietf.org/html/rfc7231#section-4.1>), it
1029
+ * possibly does not support HEAD, but only understands GET. The server
1030
+ * will signal this in the response with 405 Method Not Allowed. If this
1031
+ * happens, this function switches to GET.
1021
1032
*
1022
- * For security reasons, if the server does not tell about the resource media type, this function does _not_ employ content
1023
- * sniffing to find out whether the referenced representation is an HTML document. Instead, it assumes the media type to be
1024
- * "application/octet-stream" (which means that it ignores the content of the document). See RFC 7231 section 3.1.1.5
1025
- * <https://tools.ietf.org/html/rfc7231#section-3.1.1.5>.
1033
+ * For security reasons, if the server does not tell about the resource
1034
+ * media type, this function does _not_ employ content sniffing to find out
1035
+ * whether the referenced representation is an HTML document. Instead, it
1036
+ * assumes the media type to be "application/octet-stream" (which means
1037
+ * that it ignores the content of the document). See RFC 7231 section
1038
+ * 3.1.1.5 <https://tools.ietf.org/html/rfc7231#section-3.1.1.5>.
1026
1039
*
1027
- * The download size is almost always limited: this function employs TOOL_CRAWLER_HEADER_LIMIT as size limit for each of the
1028
- * HTTP headers (NB: not header-fields). External resources are usually not downloaded in full, but at most
1029
- * TOOL_CRAWLER_DOWNLOAD_LIMIT octets are retrieved. This is normally enough by far to extract the title of external HTML
1040
+ * The download size is almost always limited: this function employs
1041
+ * TOOL_CRAWLER_HEADER_LIMIT as size limit for each of the HTTP headers
1042
+ * (NB: not header-fields). External resources are usually not downloaded
1043
+ * in full, but at most TOOL_CRAWLER_DOWNLOAD_LIMIT octets are retrieved.
1044
+ * This is normally enough by far to extract the title of external HTML
1030
1045
* documents.
1031
1046
*
1032
- * When redirections are followed, the size of the HTTP bodies (e.g. documents informing about the redirection) is limited, too,
1033
- * with TOOL_CRAWLER_REDIRECTION_DOWNLOAD_LIMIT as the maximum allowed size.
1047
+ * When redirections are followed, the size of the HTTP bodies (e.g.
1048
+ * documents informing about the redirection) is limited, too, with
1049
+ * TOOL_CRAWLER_REDIRECTION_DOWNLOAD_LIMIT as the maximum allowed size.
1034
1050
*
1035
- * There is normally no need to fully download non-HTML resources, even if their size cannot be determined from the headers. The
1036
- * function will store fuzzy sizes as well because even incomplete information can be useful in reports. Sizes can either be
1037
- * unknown; or be exact; or be inexact, but a lower bound (in case of aborted downloads).
1051
+ * There is normally no need to fully download non-HTML resources, even if
1052
+ * their size cannot be determined from the headers. The function will
1053
+ * store fuzzy sizes as well because even incomplete information can be
1054
+ * useful in reports. Sizes can either be unknown; or be exact; or be
1055
+ * inexact, but a lower bound (in case of aborted downloads).
1038
1056
*
1039
- * In most cases, it is sufficient for the average web out there and for average users of crawler reports to report external
1040
- * non-HTML documents as having an unknown size if the web server has not provided any. In order to accommodate to other users’
1041
- * wishes, this function allows to be configured: some details of how aggressive this function tries to determine resource
1042
- * lengths and HTML document titles can be adjusted by the configuration settings of the plugin; see the API documentation
1057
+ * In most cases, it is sufficient for the average web out there and for
1058
+ * average users of crawler reports to report external non-HTML documents
1059
+ * as having an unknown size if the web server has not provided any. In
1060
+ * order to accommodate to other users’ wishes, this function allows to be
1061
+ * configured: some details of how aggressive this function tries to
1062
+ * determine resource lengths and HTML document titles can be adjusted by
1063
+ * the configuration settings of the plugin; see the API documentation
1043
1064
* comments for TOOL_CRAWLER_NETWORKSTRAIN_*.
1044
1065
*
1045
- * While _external_ documents do not need to be fully retrieved, _HTML documents_ which are located _on the own Moodle web
1046
- * server_ are always fully retrieved and parsed. This is necessary so that their links can be followed.
1066
+ * While _external_ documents do not need to be fully retrieved, _HTML
1067
+ * documents_ which are located _on the own Moodle web server_ are always
1068
+ * fully retrieved and parsed. This is necessary so that their links can be
1069
+ * followed.
1047
1070
*
1048
- * The code of this function has to consider at least the following things that can happen (possibly combined):
1049
- * * curl_exec() signals an error,
1050
- * * 405 Method Not Allowed in response to HEAD request,
1051
- * * oversize header,
1052
- * * oversize body in response to GET request,
1053
- * * HTTP redirection,
1054
- * * transfer is aborted by this function itself,
1055
- * * resource is located on an _external_ host,
1056
- * * redirection points to an external host, but the target resource is located on our web server again.
1071
+ * The code of this function has to consider at least the following things
1072
+ * that can happen (possibly combined): * curl_exec() signals an error, *
1073
+ * 405 Method Not Allowed in response to HEAD request, * oversize header, *
1074
+ * oversize body in response to GET request, * HTTP redirection, * transfer
1075
+ * is aborted by this function itself, * resource is located on an
1076
+ * _external_ host, * redirection points to an external host, but the
1077
+ * target resource is located on our web server again.
1057
1078
*/
1058
1079
1059
1080
/**
@@ -1110,7 +1131,7 @@ public function scrape($url) {
1110
1131
if ($ config ->networkstrain == TOOL_CRAWLER_NETWORKSTRAIN_REASONABLE ) {
1111
1132
$ sizelimit = TOOL_CRAWLER_DOWNLOAD_LIMIT ;
1112
1133
} else if ($ config ->networkstrain == TOOL_CRAWLER_NETWORKSTRAIN_WASTEFUL ) {
1113
- // Always fully download if not aborted by other conditions (like: Content-Length known for non-HTML documents).
1134
+ // Always fully download if not aborted by other conditions (like: Content-Length known for non-HTML documents).
1114
1135
$ sizelimit = -1 ; // No size limit.
1115
1136
} else {
1116
1137
$ sizelimit = $ config ->bigfilesize * 1000000 ;
0 commit comments