[links-list] Re: bug? (scanning NUL characters is VERY slow)

Petr Baudis pasky at ucw.cz
Fri Nov 29 11:05:16 PST 2002


Dear diary, on Fri, Nov 29, 2002 at 06:18:22PM CET, I got a letter,
where José Luis González González <jlg80 at mi.madritel.es> told me, that...
> Hi,
> 
> I noticed that both Links and ELinks take lots of time parsing files
> with NUL characters.  This time seems to be proportional to the number
> of NUL characters, to say the least.
> 
> It's easy to reproduce:
> 
> $ cat testfile.html
> <html>
> <head>
> <title>Testfile</title>
> </head>
> <body>
> <p>This file includes NUL characters</p>
> $ dd if=/dev/zero bs=1k count=70 >>testfile.html
> $ echo '</body></html>' >>testfile.html
> $ time links -dump testfile.html >/dev/null # This will be very slow
> 
> Since some of you may think a HTML document should never contain them,
> take a look at http://www.joelonsoftware.com/navLinks/fog0000000247.html
> 
> NUL characters should be ignored when scanning, so where does the
> overhead come from?  Are they actually not ignored?

Thanks, fixed the problem. Now the single-NUL case is slightly worse than
before, but the worst case is _dramatically_ better ;-).

FYI, the diff is :

Index: parser.c
===================================================================
RCS file: /home/cvs/elinks/elinks/src/document/html/parser.c,v
retrieving revision 1.45
diff -u -r1.45 parser.c
--- parser.c	29 Nov 2002 16:26:13 -0000	1.45
+++ parser.c	29 Nov 2002 19:00:28 -0000
@@ -2532,6 +2532,7 @@
 		int namelen;
 		struct element_info *ei;
 		int inv;
+		int dotcounter = 0;
 
 		if (WHITECHAR(*html) && par_format.align != AL_NO) {
 			unsigned char *h = html;
@@ -2553,7 +2554,8 @@
 			html++;
 			if (!(pos + (html-lt-1))) goto skip_w; /* ??? */
 			if (*(html - 1) == ' ') {
-				if (html < eof && !WHITECHAR(*html)) continue;	/* BIG performance win; not sure if it doesn't cause any bug */
+				/* BIG performance win; not sure if it doesn't cause any bug */
+				if (html < eof && !WHITECHAR(*html)) continue;
 				put_chrs(lt, html - lt, put_chars, f);
 			} else {
 				put_chrs(lt, html - 1 - lt, put_chars, f);
@@ -2592,13 +2594,22 @@
 			}
 		}
 
-		if (*html < ' ') {
+		while (*html < ' ') {
 			/*if (putsp == 1) goto put_sp;
 			putsp = 0;*/
-			put_chrs(lt, html - lt, put_chars, f);
-			put_chrs(".", 1, put_chars, f);
-			html++;
-			goto set_lt;
+			if (html - lt) put_chrs(lt, html - lt, put_chars, f);
+			dotcounter++;
+			html++; lt = html;
+			if (*html >= ' ' || WHITECHAR(*html) || html >= eof) {
+				unsigned char *dots = mem_alloc(dotcounter);
+
+				if (dots) {
+					memset(dots, '.', dotcounter);
+					put_chrs(dots, dotcounter, put_chars, f);
+					mem_free(dots);
+				}
+				goto set_lt;
+			}
 		}
 
 		if (html + 2 <= eof && html[0] == '<' && (html[1] == '!' || html[1] == '?') && !d_opt->plain) {

With a little of patience, it should apply without any bigger problems to
original Links as well (html.c there).

-- 
 
				Petr "Pasky" Baudis
.
> I don't know why people still want ACL's. There were noises about them for
> samba, but I'v enot heard anything since. Are vendors using this?
Because People Are Stupid(tm).  Because it's cheaper to put "ACL support: yes"
in the feature list under "Security" than to make sure than userland can cope
with anything more complex than  "Me Og.  Og see directory.  Directory Og's.
Nobody change it".  C.f. snake oil, P.T.Barnum and esp. LSM users
        -- Al Viro
.
Crap: http://pasky.ji.cz/
-- 
Unsubscribe: send email to links-list-request at linuxfromscratch.org
and put unsubscribe in the subject header of the message



More information about the links-list mailing list